手搓GPT系列之 - chatgpt + langchain 实现一个书本解读机器人

这篇具有很好参考价值的文章主要介绍了手搓GPT系列之 - chatgpt + langchain 实现一个书本解读机器人。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

ChatGPT已经威名远播，关于如何使用大模型来构建应用还处于十分前期的探索阶段。各种基于大模型的应用技术也层出不穷。本文将给大家介绍一款基于大模型的应用框架：langchain。langchain集成了做一个基于大模型应用所需的一切。熟悉java web应用的同学们应该十分熟悉spring boot框架，我们可以说langchain 就是大语言模型应用方面的spring boot。本文将为大语言模型应用的开发者们提供一个基于langchain的示例项目，便于大家进一步提升prompt engineering的效能。

1. 这个demo实现了一个什么需求

本示例项目将实现一个机器人，这个机器人会从指定路径读取电子书内容（格式为epub），并根据所读取的书本内容回答用户问题。即题目中所说的书本解读机器人。

2. 准备开发环境

安装python 3.8以上，目前的最新版本是3.11。
本例采用jupyter-lab作为开发环境，因此需要在电脑上安装jupyter-lab。
注册openai账户，并设置OPENAI_API_KEY环境变量。
我们使用redis来保存所加载的书本的内容，因此需要部署一个redis服务。不同于我们平时一般web应用使用的redis服务，我们这次需要安装redis-stack：
```
docker run -d -p 13333:8001 -p 10001:6379 redis/redis-stack:latest
```

然后安装相关的python依赖包

pip install openai
pip install langchain
pip install redis
pip install unstructured

安装pandoc，加载epub电子书要用。
准备电子书的内容，在项目目录下（即你的ipynb文件同一目录下）建立resources/epub目录，把epub格式的电子书放在该目录下。为了方便大家使用，笔者为大家准备了示例中用到的电子书资源，大家可以下载使用。

3. import所需的包

import requests
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.redis import Redis
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document
from langchain.document_loaders import UnstructuredEPubLoader
import os
from langchain import OpenAI, VectorDBQA
from langchain.agents.agent_toolkits import (
   create_vectorstore_agent,
   VectorStoreToolkit,
   VectorStoreRouterToolkit,
   VectorStoreInfo,
)

4. 加载epub书内容

列出resources/epub目录下的所有电子书，然后把所有电子书读取到documents的字典中。请注意下方的代码，我们只保留category = NarrativeText的文本，其他文本类型包括：Title, UncategorizedText, ListItem等。

dir = 'resources/epub'
fs = os.listdir(dir)
data={}
documents={}
for f in fs:
    path = dir + '/' + f
    if (os.path.isfile(path)):
        print(path)
        loader = UnstructuredEPubLoader(path,mode='elements')
        data[f]=loader.load()

for book in data.keys():
    documents[book]=[]
    for seg in data[book]:
        cat = seg.metadata['category']
        if cat == 'NarrativeText':
            documents[book].append(seg)

resources/epub/California - Sara Benson.epub
resources/epub/LP_台湾_en.epub

5. 把数据以word vector的格式存入redis数据库。

为每本书建立一个index。以书名作为index的名称。

redis_url='redis://localhost:10001'
embeddings=OpenAIEmbeddings()
for book in documents.keys():
    rds = Redis.from_documents(documents[book],embeddings,redis_url=redis_url,index_name=book)

6. 尝试做一次最相似查询

rds = Redis.from_existing_index(embeddings, redis_url=redis_url, index_name='LP_台湾_en.epub')
query = '台湾日月潭'
rds.similarity_search(query)

[Document(page_content='鯉魚潭;\r\nLǐyú Tán), a pretty willow-lined pond with a lush green mountain\r\nbackdrop, you’ll find hot springs', metadata={'source': 'resources/epub/LP_台湾_en.epub', 'page_number': 1, 'category': 'NarrativeText'}),
 Document(page_content='(明月溫泉 Míngyuè Wēnquán;  2661 7678; www.fullmoonspa.net; 1 Lane\r\n85, Wulai St; unlimited time public pools NT$490) One of\r\nthe more stylish hotels along the tourist street, Full Moon has mixed\r\nand nude segregated pools with nice views over the Tongshi River. Its\r\nprivate rooms feature wooden tubs. The hotel also offers rooms for\r\novernight stays from NT$2700. Go for the lower cheaper rooms as the\r\nviews are surprisingly better than higher up.', metadata={'source': 'resources/epub/LP_台湾_en.epub', 'page_number': 1, 'category': 'NarrativeText'}),
 Document(page_content='7 Sun Moon Lake (Click\r\nhere) is the largest body of water in Taiwan and boasts a\r\nwatercolour background ever changing with the season and light. Although\r\nthe area is packed with Chinese tourists these days it’s still\r\nremarkably easy to get away from the crowds on the many trails and\r\ncycling paths. Loop down to the old train depot at Checheng to explore\r\n1950s Taiwan, or head to Shuili to see the last working snake kiln. No\r\nmatter what, don’t miss the region’s high-mountain oolong tea: it’s some\r\nof the finest in the world.', metadata={'source': 'resources/epub/LP_台湾_en.epub', 'page_number': 1, 'category': 'NarrativeText'}),
 Document(page_content='(鯉魚潭露營區 Lǐyú Tán Lùyíng qū;  03-865 5678; per site NT$800) The\r\ncampground is 1km south of the lake off Hwy 9 and features showers,\r\nbarbecue areas and covered sites.', metadata={'source': 'resources/epub/LP_台湾_en.epub', 'page_number': 1, 'category': 'NarrativeText'})]

5. 创建一个vector_store_agent

使用VectorStoreRouterToolkit可以将多本书一起作为输入，根据用户的问题切换到最合适的书。另一个可以选的toolkit叫VectorStoreToolkit(vectorstore_info=vectorstore_info)。这个toolkit的使用思路时把多本书存在一个index下，机器人会综合所有书的相关内容做出解答，另外如果用户要求提供来源，机器人会提取metadata里的’source’字段并回复。

llm = OpenAI(temperature=0)
rdss = {}
infos=[]
for book in documents.keys():
    rdss[book] = Redis.from_existing_index(embeddings, redis_url=redis_url, index_name=book)
    vectorstore_info = VectorStoreInfo(
        name="hotest_travel_advice_about_"+ book,
        description="the best travel advice about " + book,
        vectorstore=rdss[book]
    )
    infos.append(vectorstore_info)

#使用VectorStoreRouterToolkit可以将多本书一起作为输入，根据用户的问题切换到最合适的书。
toolkit = VectorStoreRouterToolkit(vectorstores=infos, llm=llm)
agent_executor = create_vectorstore_agent(
    llm=llm,
    toolkit=toolkit,
    verbose=True)

6. 此时我们可以查看一下prompt到底长什么样子

print(agent_executor.agent.llm_chain.prompt)

input_variables=['input', 'agent_scratchpad'] output_parser=None partial_variables={} template='You are an agent designed to answer questions about sets of documents.\nYou have access to tools for interacting with the documents, and the inputs to the tools are questions.\nSometimes, you will be asked to provide sources for your questions, in which case you should use the appropriate tool to do so.\nIf the question does not seem relevant to any of the tools provided, just return "I don\'t know" as the answer.\n\n\nhotest_travel_advice_about_California - Sara Benson.epub: Useful for when you need to answer questions about hotest_travel_advice_about_California - Sara Benson.epub. Whenever you need information about the best travel advice about California - Sara Benson.epub you should ALWAYS use this. Input should be a fully formed question.\nhotest_travel_advice_about_LP_台湾_en.epub: Useful for when you need to answer questions about hotest_travel_advice_about_LP_台湾_en.epub. Whenever you need information about the best travel advice about LP_台湾_en.epub you should ALWAYS use this. Input should be a fully formed question.\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [hotest_travel_advice_about_California - Sara Benson.epub, hotest_travel_advice_about_LP_台湾_en.epub]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: {input}\nThought:{agent_scratchpad}' template_format='f-string' validate_template=True

7. 到这里就完成了，可以问跟书本有关的问题。

清注意，机器人会根据你的问题，自动找出最相关的书本来给你作答。

resp = agent_executor.run("日月潭什么时候去旅游比较好，请用中文回答")
print(resp)

> Entering new AgentExecutor chain...
 I should use hotest_travel_advice_about_LP_台湾_en.epub to answer this question
Action: hotest_travel_advice_about_LP_台湾_en.epub
Action Input: 日月潭什么时候去旅游比较好
Observation:  The best time to visit Sun Moon Lake is during autumn and early spring (October to December and March to April). May has seasonal monsoon rains, and typhoons are a problem from June to September, though if there is no typhoon, you can certainly visit.
Thought: I now know the final answer
Final Answer: 日月潭最好的旅游时间是秋季和初春（10月到12月和3月到4月）。五月有季节性的季风雨，6月到9月有台风，但是如果没有台风，你也可以去旅游。

> Finished chain.
日月潭最好的旅游时间是秋季和初春（10月到12月和3月到4月）。五月有季节性的季风雨，6月到9月有台风，但是如果没有台风，你也可以去旅游。

最后

由于只是一个示例项目，因此本项目只是通过jupyter lab来实现，并非一个真实可以提供任何web服务的项目，其中的一些逻辑也简单化处理。例如只能读取笔者预先放置在项目工程目录下的epub格式的电子书，而不能由用户自由上传电子书并为用户提供解答服务。再例如如果连续运行两次加载电子书的操作，会在数据库中留下重复的数据，本例子中并未包含去重的逻辑。由于这个项目的主要目的是为了探讨langchain在大模型应用开发中的作用，而非巨细靡遗地实现一个可以商业化的机器人，笔者认为，增加很多处理细节的业务逻辑，会导致项目的最主要部分被很多非核心的代码掩埋，反而不利于读者对langchain建立一种清晰的认知。因此读者朋友在将本例中的代码应用于自己的项目中时，应该注意完善各种细节，以避免项目出现缺陷。