求助Claude GPT | PyPDF2 亲测可用-Toy模板网

这篇具有很好参考价值的文章主要介绍了求助Claude GPT | PyPDF2 亲测可用。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

直接Copy网上流传最广的代码完成提取PDF转文本，十有八九报错！

流传的是旧版本用法，造成的问题是命令规则用法变化。

常见的报错：

1、读取PDF文档命令的变化：

去除PdfFileReader不可用，

改为PyPDF2.PdfReader(read_pdf)

**报错信息：

PyPDF2.errors.DeprecationError: 
PdfFileReader is deprecated and 
was removed in PyPDF2 3.0.0. 
Use PdfReader instead.

2、获取PDF文档页数的命令变化：

去除reader.getPage(page_number)不可用，

改为reader.pages[page_number]

PyPDF2.errors.DeprecationError: 
reader.getPage(pageNumber) is 
deprecated and was removed in 
PyPDF2 3.0.0. 
Use reader.pages[page_number] instead.

3、编码格式错误提示

location.writelines(extract_text)
UnicodeEncodeError: 

'cp950' codec can't encode character
'\u2212' in position 0: 
illegal multibyte sequence

求助Claude GPT给出的修改建议

错误表示在写入提取的文本时,遇到了无法使用指定的编码'cp950'来编码文本中的某个字符。

具体是遇到了一个破折号字符'\u2212',它在Unicode中的编码无法用cp950来表示，所以抛出了UnicodeEncodeError。

可能的解决方案:

尝试使用其他编码格式来写入文件,比如'utf-8':

#1 方案
location.writelines(extract_text.encode('utf-8'))

将无法编码的字符替换为可编码的字符:

#2 方案
extract_text = extract_text.replace('\u2212', '-')
location.writelines(extract_text)

忽略无法编码的字符:

#3 方案
location.writelines(extract_text.encode('cp950', errors='ignore'))

将文本转化为另一种兼容格式再写入:

#4 方案
extract_text = extract_text.encode('cp950', errors='ignore').decode('cp950') 
location.writelines(extract_text)

主要原因是'cp950'编码范围有限,无法表示文本中的所有Unicode字符。

可以尝试更高版本的Unicode编码如utf-8,或者处理掉无法编码的字符。

结论：经过测试方案#3 有效，一次运行成功通过！

完整的PDF 2 TXT 的代码：

import PyPDF2
read_pdf = open("D://0 All_in_one//Alevel 教程//Game Theory An Introduction.pdf", 'rb')
pdfload = PyPDF2.PdfReader(read_pdf)

num_pages = len(pdfload.pages)#read_pdf.numPages
print(num_pages)

# 初始化文本字符串变量
text = ""

# 遍历每一页
for pageNum in range(num_pages):
    pageObj = pdfload.pages[pageNum-1]
    text += pageObj.extract_text()

# 保存文本结果
location = open("D://0 All_in_one//Alevel 教程//Game Theory An Introduction.txt", "a")
outfile = "D://0 All_in_one//Alevel 教程//Game Theory An Introduction.txt"
with open(outfile, 'w') as f:
    f.writelines(text.encode('cp950', errors='ignore').decode('cp950'))