GPT4_Retrieval_Augmentation

这篇具有很好参考价值的文章主要介绍了GPT4_Retrieval_Augmentation。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

Retrieval Augmentation for GPT-4 using Pinecone

Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a GPT-4 model to generate an answer backed by real data sources.

GPT-4 is a big step up from previous OpenAI completion models. It also exclusively uses the ChatCompletion endpoint, so we must use it in a slightly different way to usual. However, the power of the model makes the change worthwhile, particularly when augmented with an external knowledge base like the Pinecone vector database.

Required installs for this notebook are:

!pip install -qU bs4 tiktoken openai langchain pinecone-client[grpc]

Preparing the Data

In this example, we will download the LangChain docs from langchain.readthedocs.io/. We get all .html files located on the site like so:

!wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/

This downloads all HTML into the rtdocs directory. Now we can use LangChain itself to process these docs. We do this using the ReadTheDocsLoader like so:

from langchain.document_loaders import ReadTheDocsLoader#读取文件的模块

loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()
len(docs)
This leaves us with hundreds of processed doc pages. Let's take a look at the format each one contains:
docs[0]
We access the plaintext page content like so:
print(docs[0].page_content)
print(docs[5].page_content)
We can also find the source of each document:
docs[5].metadata['source'].replace('rtdocs/', 'https://')
We can use these to create our `data` list:
data = []

for doc in docs:
    data.append({
        'url': doc.metadata['source'].replace('rtdocs/', 'https://'),
        'text': doc.page_content
    })
data[3]

It’s pretty ugly but it’s good enough for now. Let’s see how we can process all of these. We will chunk everything into ~400 token chunks, we can do this easily with langchain and tiktoken:

import tiktoken


tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)
tokenizer

“”"
这个文本是一个Python脚本,它使用了名为"tiktoken"的Python模块来进行文本编码。这个模块提供了一个名为"get_encoding"的函数,用于获取指定编码的tokenizer对象。在这个脚本中,我们使用了"p50k_base"编码的tokenizer对象。

接下来,脚本定义了一个名为"tiktoken_len"的函数,它接受一个文本参数,并返回该文本的编码长度。在函数内部,它使用了之前获取的tokenizer对象来对文本进行编码,并计算编码后的token数量作为文本长度。

在技术术语方面,这个脚本使用了Python编程语言和相关的模块和函数。它还使用了自然语言处理中的文本编码技术,其中tokenizer对象将文本转换为一系列token以进行进一步处理。
“”"
f

rom langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)
Process the `data` into more chunks using this approach.
from uuid import uuid4
from tqdm.auto import tqdm

chunks = []

for idx, record in enumerate(tqdm(data)):
    texts = text_splitter.split_text(record['text'])
    chunks.extend([{
        'id': str(uuid4()),
        'text': texts[i],
        'chunk': i,
        'url': record['url']
    } for i in range(len(texts))])
chunks

Our chunks are ready so now we move onto embedding and indexing everything.
这段代码使用了OpenAI的API来创建一个embedding模型。首先,需要初始化OpenAI的API密钥。然后,指定了使用的embedding模型为"text-embedding-ada-002"。接着,使用OpenAI的Embedding API来创建一个embedding向量,将两个文本作为输入,并指定使用的engine为上述的embedding模型。这个embedding向量可以用于计算两个文本之间的相似度或距离。其中涉及的技术术语包括OpenAI API、embedding模型、engine、embedding向量、相似度和距离。

i

mport openai

# initialize openai API key
openai.api_key = ""  #platform.openai.com

embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

Initialize Embedding Model

We use text-embedding-ada-002 as the embedding model. We can embed text like so:
In the response res we will find a JSON-like object containing our new embeddings within the 'data' field.
res.keys()
Inside 'data' we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains 1536 dimensions (the output dimensionality of the text-embedding-ada-002 model.

len(res['data'])
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

We will apply this same embedding logic to the langchain docs dataset we’ve just scraped. But before doing so we must create a place to store the embeddings.

Initializing the Index

Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a free API key and enter it below where we will initialize our connection to Pinecone and create a new index.

import pinecone

index_name = 'gpt-4-langchain-docs'
# index_name = "langchain-demo"

# initialize connection to pinecone
pinecone.init(
    api_key="",  # app.pinecone.io (console)
    environment=""  # next to API key in console
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='dotproduct'
    )
# connect to index
index = pinecone.GRPCIndex(index_name)
# view index stats
index.describe_index_stats()
pinecone.list_indexes()
We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:
from tqdm.auto import tqdm
import datetime
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(chunks), batch_size)):
    # find end of batch
    i_end = min(len(chunks), i+batch_size)
    meta_batch = chunks[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'text': x['text'],
        'chunk': x['chunk'],
        'url': x['url']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

Now we’ve added all of our langchain docs to the index. With that we can move on to retrieval and then answer generation using GPT-4.

Retrieval

To search through our documents we first need to create a query vector xq. Using xq we will retrieve the most relevant chunks from the LangChain docs, like so:

query = "how do I use the LLMChain in LangChain?"

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

get relevant contexts (including the questions)

res = index.query(xq, top_k=5, include_metadata=True)
res
With retrieval complete, we move on to feeding these into GPT-4 to produce answers.

Retrieval Augmented Generation

GPT-4 is currently accessed via the ChatCompletions endpoint of OpenAI. To add the information we retrieved into the model, we need to pass it into our user prompts alongside our original query. We can do that like so:文章来源地址https://www.toymoban.com/news/detail-467797.html

# get list of retrieved text
contexts = [item['metadata']['text'] for item in res['matches']]

augmented_query = "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query
print(augmented_query)
Now we ask the question:
# system message to 'prime' the model
primer = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
"""

res = openai.ChatCompletion.create(
    # model="gpt-4",
    model= "gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)
To display this response nicely, we will display it in markdown.
from IPython.display import Markdown

display(Markdown(res['choices'][0]['message']['content']))
Let's compare this to a non-augmented query...
res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))
If we drop the `"I don't know"` part of the `primer`?
res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are Q&A bot. A highly intelligent system that answers user questions"},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

到了这里,关于GPT4_Retrieval_Augmentation的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 如何用GPT/GPT4进行AI绘图?

    详情点击链接:如何用GPT/GPT4进行AI绘图? 一OpenAI 1.最新大模型GPT-4 Turbo 2.最新发布的高级数据分析,AI画图,图像识别,文档API 3.GPT Store 4.从0到1创建自己的GPT应用 5. 模型Gemini以及大模型Claude2 二定制自己的GPTs 1.自定义GPTs使用 2.聊天交流的方式制作自己的GPTs 3.自定义的方式

    2024年01月16日
    浏览(28)
  • 【GPT4】微软 GPT-4 测试报告(7)判别能力

    欢迎关注【youcans的AGI学习笔记】原创作品,火热更新中 微软 GPT-4 测试报告(1)总体介绍 微软 GPT-4 测试报告(2)多模态与跨学科能力 微软 GPT-4 测试报告(3)编程能力 微软 GPT-4 测试报告(4)数学能力 微软 GPT-4 测试报告(5)与外界环境的交互能力 微软 GPT-4 测试报告(

    2024年02月15日
    浏览(37)
  • 在生信中利用Chat GPT/GPT4

    论文链接Ten Quick Tips for Harnessing the Power of ChatGPT/GPT-4 in Computational Biology | Papers With Code 之前在paper with code上比较火的一篇文章,最近要给生科的学长学姐们个分享所以把这个翻了翻,原文自认为废话比较多,于是选了点有用部分的放这给大家学习。 虽然我们主要关注的是当前

    2024年02月11日
    浏览(25)
  • 【GPT引领前沿】GPT4技术与AI绘图

    推荐阅读: 1、遥感云大数据在灾害、水体与湿地领域典型案例实践及GPT模型应用 2、GPT模型支持下的Python-GEE遥感云大数据分析、管理与可视化技术   GPT对于每个科研人员已经成为不可或缺的辅助工具,不同的研究领域和项目具有不同的需求。 例如在科研编程、绘图领域 :

    2024年02月09日
    浏览(40)
  • 一键拥有你的GPT4

    这几天我一直在帮朋友升级ChatGPT,现在已经可以闭眼操作了哈哈😝。我原本以为大家都已经用上GPT4,享受着它带来的巨大帮助时,但结果还挺让我吃惊的,还是有很多人仍苦于如何进行升级。所以就想着写篇教程来教会大家如何进行升级(主要是我懒,下次有人问的时候直

    2024年01月24日
    浏览(35)
  • GptFuck—开源Gpt4分享

    这个项目不错,分享给大家 项目地址传送门

    2024年02月09日
    浏览(24)
  • gpt4和gpt3.5对比有什么提升?

    GPT4和GPT3.5都是由OpenAI开发的大规模自然语言生成(NLG)模型,它们可以根据给定的文本输入生成相关的文本输出。它们都属于预训练语言模型(PLM),即在大量无标注文本上进行无监督学习,然后在特定任务上进行微调或零样本学习。 那么,GPT4和GPT3.5有什么区别呢?主要有

    2023年04月08日
    浏览(38)
  • 如何让GPT/GPT4成为你的编程助手?

    详情点击链接:如何让GPT/GPT4成为你的编程助手? 一OpenAI 1.最新大模型GPT-4 Turbo 2.最新发布的高级数据分析,AI画图,图像识别,文档API 3.GPT Store 4.从0到1创建自己的GPT应用 5. 模型Gemini以及大模型Claude2 二定制自己的GPTs 1.自定义GPTs使用 2.聊天交流的方式制作自己的GPTs 3.自定义

    2024年01月21日
    浏览(37)
  • GPT4free安装部署教程 - 白嫖GPT

    为啥之前一直没有更新GPT相关的内容,因为个人觉得如果每次都需要使用付费使用API的话,那这个工具还是很难在个人手上被运用起来,多测试几次和清洗数据,API的费用对个人来说都太高昂了 直到 GPT4free 出现 公众号后台回复 1002 ,获取GPT试用网址 直接开始部署吧,

    2024年02月02日
    浏览(28)
  • 文心一言 vs GPT4

    本周真是科技爱好者的狂欢节。 GPT4 和文心一言接连发布, AI 工具已经开始走进千家万户。 拿文心一言发布会上的几个问题调戏了 GPT4 一下,看看表现如何。 第一个为文心的回答,第二个为 GPT4 的回答。 看起来,文心把 “ 续写 ” 理解成了 “ 改写 ” 。 嗯,这俩 AI 都很

    2023年04月21日
    浏览(32)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包