【chatgpt】将PDF文件当做知识源-Toy模板网

这篇具有很好参考价值的文章主要介绍了【chatgpt】将PDF文件当做知识源。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

概要

指定知识源来回答问题。非常适用于公司里某些专业领域。
下文是将2023_GPT4All_Technical_Report.pdf文件当做知识源，来回答问题。

具体：

通过加载PDF文件，读取里面的内容。
将内容进行压缩成块，交给openai embeddings处理（创建知识的门牌号、房间(具体知识)的对应关系）
利用FAISS(short for Facebook AI Similarity Search)，进行问题搜索，得到答案。
再将问题和答案，交给openai进行润色。

准备工作

pip install langchain
pip install openai
pip install PyPDF2
pip install faiss-cpu
pip install tiktoken

代码

from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

import os
os.environ["OPENAI_API_KEY"] = "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"


# location of the pdf file/files. 
reader = PdfReader('/Users/yutao/Downloads/2023_GPT4All_Technical_Report.pdf')

# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

# raw_text
# raw_text[:100]

text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

print(len(texts))
# print(texts[0])

# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

# faiss是Facebook ai similarity search的缩写
# 一种为了对嵌入向量进行高效搜索的索引结构
# https://huggingface.co/learn/nlp-course/chapter5/6?fw=pt#using-faiss-for-efficient-similarity-search
docsearch = FAISS.from_texts(texts, embeddings)

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(), chain_type="stuff")

query = "who are the authors of the article?"
docs = docsearch.similarity_search(query)
# 将搜索到的结果、问题，交给openai进行润色
aa = chain.run(input_documents=docs, question=query)
print("---------")
# print(docs)
print(aa)

# 理解：embeddings 将分词数据，映射到向量空间中，用于相关性的计算。

query = "What was the cost of training the GPT4all model?"
docs = docsearch.similarity_search(query)
aa = chain.run(input_documents=docs, question=query)
print(aa)

参考地址：

https://colab.research.google.com/drive/181BSOH6KF_1o2lFG8DQ6eJd2MZyiSBNt?usp=sharing#scrollTo=2VXlucKiW7bX文章来源地址https://www.toymoban.com/news/detail-493989.html

到了这里，关于【chatgpt】将PDF文件当做知识源的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！