Retrieval Augmentation for GPT-4 using Pinecone
Fixing LLMs that Hallucinate
In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a GPT-4 model to generate an answer backed by real data sources.
GPT-4 is a big step up from previous OpenAI completion models. It also exclusively uses the ChatCompletion
endpoint, so we must use it in a slightly different way to usual. However, the power of the model makes the change worthwhile, particularly when augmented with an external knowledge base like the Pinecone vector database.
Required installs for this notebook are:
!pip install -qU bs4 tiktoken openai langchain pinecone-client[grpc]
Preparing the Data
In this example, we will download the LangChain docs from langchain.readthedocs.io/. We get all .html
files located on the site like so:
!wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/
This downloads all HTML into the rtdocs
directory. Now we can use LangChain itself to process these docs. We do this using the ReadTheDocsLoader
like so:
from langchain.document_loaders import ReadTheDocsLoader#读取文件的模块
loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()
len(docs)
This leaves us with hundreds of processed doc pages. Let's take a look at the format each one contains:
docs[0]
We access the plaintext page content like so:
print(docs[0].page_content)
print(docs[5].page_content)
We can also find the source of each document:
docs[5].metadata['source'].replace('rtdocs/', 'https://')
We can use these to create our `data` list:
data = []
for doc in docs:
data.append({
'url': doc.metadata['source'].replace('rtdocs/', 'https://'),
'text': doc.page_content
})
data[3]
It’s pretty ugly but it’s good enough for now. Let’s see how we can process all of these. We will chunk everything into ~400 token chunks, we can do this easily with langchain
and tiktoken
:
import tiktoken
tokenizer = tiktoken.get_encoding('p50k_base')
# create the length function
def tiktoken_len(text):
tokens = tokenizer.encode(
text,
disallowed_special=()
)
return len(tokens)
tokenizer
“”"
这个文本是一个Python脚本,它使用了名为"tiktoken"的Python模块来进行文本编码。这个模块提供了一个名为"get_encoding"的函数,用于获取指定编码的tokenizer对象。在这个脚本中,我们使用了"p50k_base"编码的tokenizer对象。
接下来,脚本定义了一个名为"tiktoken_len"的函数,它接受一个文本参数,并返回该文本的编码长度。在函数内部,它使用了之前获取的tokenizer对象来对文本进行编码,并计算编码后的token数量作为文本长度。
在技术术语方面,这个脚本使用了Python编程语言和相关的模块和函数。它还使用了自然语言处理中的文本编码技术,其中tokenizer对象将文本转换为一系列token以进行进一步处理。
“”"
f
rom langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=20,
length_function=tiktoken_len,
separators=["\n\n", "\n", " ", ""]
)
Process the `data` into more chunks using this approach.
from uuid import uuid4
from tqdm.auto import tqdm
chunks = []
for idx, record in enumerate(tqdm(data)):
texts = text_splitter.split_text(record['text'])
chunks.extend([{
'id': str(uuid4()),
'text': texts[i],
'chunk': i,
'url': record['url']
} for i in range(len(texts))])
chunks
Our chunks are ready so now we move onto embedding and indexing everything.
这段代码使用了OpenAI的API来创建一个embedding模型。首先,需要初始化OpenAI的API密钥。然后,指定了使用的embedding模型为"text-embedding-ada-002"。接着,使用OpenAI的Embedding API来创建一个embedding向量,将两个文本作为输入,并指定使用的engine为上述的embedding模型。这个embedding向量可以用于计算两个文本之间的相似度或距离。其中涉及的技术术语包括OpenAI API、embedding模型、engine、embedding向量、相似度和距离。
i
mport openai
# initialize openai API key
openai.api_key = "" #platform.openai.com
embed_model = "text-embedding-ada-002"
res = openai.Embedding.create(
input=[
"Sample document text goes here",
"there will be several phrases in each batch"
], engine=embed_model
)
Initialize Embedding Model
We use text-embedding-ada-002
as the embedding model. We can embed text like so:
In the response res
we will find a JSON-like object containing our new embeddings within the 'data'
field.
res.keys()
Inside 'data'
we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains 1536
dimensions (the output dimensionality of the text-embedding-ada-002
model.
len(res['data'])
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])
We will apply this same embedding logic to the langchain docs dataset we’ve just scraped. But before doing so we must create a place to store the embeddings.
Initializing the Index
Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a free API key and enter it below where we will initialize our connection to Pinecone and create a new index.
import pinecone
index_name = 'gpt-4-langchain-docs'
# index_name = "langchain-demo"
# initialize connection to pinecone
pinecone.init(
api_key="", # app.pinecone.io (console)
environment="" # next to API key in console
)
# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
# if does not exist, create index
pinecone.create_index(
index_name,
dimension=len(res['data'][0]['embedding']),
metric='dotproduct'
)
# connect to index
index = pinecone.GRPCIndex(index_name)
# view index stats
index.describe_index_stats()
pinecone.list_indexes()
We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:
from tqdm.auto import tqdm
import datetime
from time import sleep
batch_size = 100 # how many embeddings we create and insert at once
for i in tqdm(range(0, len(chunks), batch_size)):
# find end of batch
i_end = min(len(chunks), i+batch_size)
meta_batch = chunks[i:i_end]
# get ids
ids_batch = [x['id'] for x in meta_batch]
# get texts to encode
texts = [x['text'] for x in meta_batch]
# create embeddings (try-except added to avoid RateLimitError)
try:
res = openai.Embedding.create(input=texts, engine=embed_model)
except:
done = False
while not done:
sleep(5)
try:
res = openai.Embedding.create(input=texts, engine=embed_model)
done = True
except:
pass
embeds = [record['embedding'] for record in res['data']]
# cleanup metadata
meta_batch = [{
'text': x['text'],
'chunk': x['chunk'],
'url': x['url']
} for x in meta_batch]
to_upsert = list(zip(ids_batch, embeds, meta_batch))
# upsert to Pinecone
index.upsert(vectors=to_upsert)
Now we’ve added all of our langchain docs to the index. With that we can move on to retrieval and then answer generation using GPT-4.
Retrieval
To search through our documents we first need to create a query vector xq
. Using xq
we will retrieve the most relevant chunks from the LangChain docs, like so:
query = "how do I use the LLMChain in LangChain?"
res = openai.Embedding.create(
input=[query],
engine=embed_model
)
# retrieve from Pinecone
xq = res['data'][0]['embedding']
get relevant contexts (including the questions)
res = index.query(xq, top_k=5, include_metadata=True)
res
With retrieval complete, we move on to feeding these into GPT-4 to produce answers.文章来源:https://www.toymoban.com/news/detail-467797.html
Retrieval Augmented Generation
GPT-4 is currently accessed via the ChatCompletions
endpoint of OpenAI. To add the information we retrieved into the model, we need to pass it into our user prompts alongside our original query. We can do that like so:文章来源地址https://www.toymoban.com/news/detail-467797.html
# get list of retrieved text
contexts = [item['metadata']['text'] for item in res['matches']]
augmented_query = "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query
print(augmented_query)
Now we ask the question:
# system message to 'prime' the model
primer = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
"""
res = openai.ChatCompletion.create(
# model="gpt-4",
model= "gpt-3.5-turbo",
messages=[
{"role": "system", "content": primer},
{"role": "user", "content": augmented_query}
]
)
To display this response nicely, we will display it in markdown.
from IPython.display import Markdown
display(Markdown(res['choices'][0]['message']['content']))
Let's compare this to a non-augmented query...
res = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": primer},
{"role": "user", "content": query}
]
)
display(Markdown(res['choices'][0]['message']['content']))
If we drop the `"I don't know"` part of the `primer`?
res = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are Q&A bot. A highly intelligent system that answers user questions"},
{"role": "user", "content": query}
]
)
display(Markdown(res['choices'][0]['message']['content']))
到了这里,关于GPT4_Retrieval_Augmentation的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!