Python 快速合并PDF表格转换输出CSV文件-Toy模板网

这篇具有很好参考价值的文章主要介绍了Python 快速合并PDF表格转换输出CSV文件。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Python 快速合并PDF表格转换输出CSV文件,Python,python 单位的刷脸考勤机后台系统做得比较差，只能导出每个部门的出勤统计表pdf，格式如下：

Python 快速合并PDF表格转换输出CSV文件,Python,python

近期领导要看所有部门的考勤数据，于是动手快速写了个合并pdf并输出csv文件的脚本。

安装模块

pypdf2，pdfplumber，前者用于合并，后者用于读表格。

C:\>pip install pypdf2
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pypdf2
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8e/5e/c86a5643653825d3c913719e788e41386bee415c2b87b4f955432f2de6b2/pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: pypdf2
Successfully installed pypdf2-3.0.1

C:\>pip install pdfplumber
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pdfplumber
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/f8/d3/f58c2d5d86a585e438c6708f568eca79e7c4e6ee3d5210cf8b31d38cb021/pdfplumber-0.10.3-py3-none-any.whl (48 kB)
Requirement already satisfied: pdfminer.six==20221105 in d:\program files\python\lib\site-packages (from pdfplumber) (20221105)
Requirement already satisfied: Pillow>=9.1 in d:\program files\python\lib\site-packages (from pdfplumber) (10.2.0)
Requirement already satisfied: pypdfium2>=4.18.0 in d:\program files\python\lib\site-packages (from pdfplumber) (4.25.0)
Requirement already satisfied: charset-normalizer>=2.0.0 in d:\program files\python\lib\site-packages (from pdfminer.six==20221105->pdfplumber) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in d:\program files\python\lib\site-packages (from pdfminer.six==20221105->pdfplumber) (41.0.7)
Requirement already satisfied: cffi>=1.12 in d:\program files\python\lib\site-packages (from cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (1.16.0)
Requirement already satisfied: pycparser in d:\program files\python\lib\site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (2.21)
Installing collected packages: pdfplumber
Successfully installed pdfplumber-0.10.3

读取、合并文件

PyPDF2

PyPDF2 用于对PDF文件的分离、合并、裁剪、转换、加密、解密等操作。

读取和合并pdf文件正好以前写过，主要代码如下：

with codecs.open(file_path, 'rb', encoding='utf-16') as file:
pdf_reader = PyPDF2.PdfReader(file)
text = ''
for page_num in range(len(pdf_reader.pages)):
tt = pdf_reader.pages[page_num].extract_text()
print(tt)
text += tt
......

pdfMerge = PyPDF2.PdfMerger()
try:
for pdf in pdfLists:
pdfMerge.append(pdf, import_outline=False)
pdfMerge.write(pdfFileN)
pdfMerge.close
print("PDF files merged successfully!")

......

表格读取

pdfplumber

pdfplumber 用于读取PDF文件文本和表格提取，功能比较均衡。

优点：

每页单独对象，支持文本、表格数据的抽取（亮点）
文本抽取：保留了文本的格式，比如换行位置有空格，可以通过这个特点将一段的文本整合
表格数据抽取：不会被换行数据所干扰

缺点:

进行文本抽取时，如果一页有文本和表格，那么抽取的文本数据也会包括表格数据
对于有合并单元格的表格，无法还原表格结构
表格数据不能100%保证和原数据一致，可能缺少几个字，可能识别出错等
对于无边框的表格，处理效果很差
流程图和时序图会对处理产生严重影响

读取代码如下：

pdf = pdfplumber.open(pdfFileN)
for page in pdf.pages:
tables = page.extract_tables(table_settings = {})
for table in tables:
print(table)

遍历得到的是一个个二维列表，可以根据需要自己清洗数据。

程序界面

easygui

就用这个库，弄2个对话框简单了事：

Python 快速合并PDF表格转换输出CSV文件,Python,python

选择一个文件夹，把其下的所有pdf文件合并，然后转换输出csv文件：

Python 快速合并PDF表格转换输出CSV文件,Python,python

输出的文件格式如下：

Python 快速合并PDF表格转换输出CSV文件,Python,python

更多easygui内容请见：

Python 简易图形界面库easygui 对话框大全-CSDN博客文章浏览阅读4.2k次，点赞117次，收藏96次。提供了“继续”和“取消”选项，并返回True（表示继续）或False（表示取消）。", title="结束", ok_button="干得好！easygui.ccbox(msg, title, choices=('退出[E]','取消[C]'))选择“Chocolate”后点OK就把所选择的项赋值给变量choice，点Cancel则返回None。如果选择了第一个按钮，则返回“True”。提供了Yes和No的选择，并返回“True”或“False”。在列表框中提供了可供选择的由元组或列表指定的选项列表。https://blog.csdn.net/boysoft2002/article/details/135179267Python 简易图形界面库easygui 对话框大全(续)-CSDN博客文章浏览阅读1.2k次，点赞67次，收藏58次。Python 简易图形界面库easygui 对话框大全-CSDN博客提供了“继续”和“取消”选项，并返回True（表示继续）或False（表示取消）。", title="结束", ok_button="干得好！easygui.ccbox(msg, title, choices=('退出[E]','取消[C]'))选择“Chocolate”后点OK就把所选择的项赋值给变量choice，点Cancel则返回None。如果选择了第一个按钮，则返回“True”。https://blog.csdn.net/boysoft2002/article/details/135297373以上几样库拼凑在一起，就可以完成合并和转换pdf表格，完整代码如下：

import sys,os
import datetime as dt
import PyPDF2,pdfplumber
import easygui as eg

def get_pdf_text(file_path):
    with codecs.open(file_path, 'rb', encoding='utf-16') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(pdf_reader.pages)):
            tt = pdf_reader.pages[page_num].extract_text()
            print(tt)
            text += tt
    return text

def strDateTime(diff=0):
    now = dt.datetime.now()
    future_time = now + dt.timedelta(days=diff)    
    return f'{future_time.year:04}{future_time.month:02}{future_time.day:02}_{future_time.hour:02}{future_time.minute:02}{future_time.second:02}'

txtStart = "PDFmerged_"
try:
    Dir = eg.diropenbox(msg=None, title=None, default='./')
    pdfLists = [f for f in os.listdir(Dir) if f.endswith('.pdf') and not f.startswith(txtStart)]
    pdfFileN = Dir + '\\' + txtStart + strDateTime() + ".pdf"
except:
    print('取消退出！')
    sys.exit(0)

if len(pdfLists)==0:
    eg.msgbox("此文件夹没有Pdf文件！", title="结束", ok_button="Fail")
    sys.exit(0)
else:
    pdfMerge = PyPDF2.PdfMerger()
    try:
        for pdf in pdfLists:
            pdfMerge.append(pdf, import_outline=False)
        pdfMerge.write(pdfFileN)
        pdfMerge.close
        print("PDF files merged successfully!")
    except:
        eg.msgbox("合并pdf失败！", title="结束", ok_button="Fail")
        sys.exit(0)

pdf =  pdfplumber.open(pdfFileN)
dct = dict()
for page in pdf.pages:
    tables = page.extract_tables(table_settings = {})
    for table in tables:
        for lst in table:
            tmp = lst[1:]
            tmp = [tmp[0]]+tmp[3:8]+[tmp[-1]]
            try:
                tmp[0] = tmp[0].replace('\n','')
                tmp[0] = tmp[0].split('/')
                tmp[0] = tmp[0][-1]
            except:
                pass
            if lst[0]=='时间':
                dct[lst[0]] = tmp[0]
            else:
                dct[','.join([lst[0],tmp[0] if tmp[0] else ''])] = ','.join(tmp[1:]) if all(tmp[1:]) else ''
pdf.close()
try:os.remove(pdfFileN)
except:pass
try:
    fn = "考勤表(" + dct['时间'] + ")"+strDateTime()+".csv"
except:
    fn = "考勤表"+strDateTime()+".csv"
try:
    with open(fn, 'w') as f:
        for k,v in dct.items():
            print(','.join([k,v]), file=f)
    eg.msgbox(f"考勤表保存成功！\n\n\n\t文件名：{fn}", title="结束", ok_button="Good！")
    print(f"CSV file written successfully! by HannYang {strDateTime()}")
except:
    eg.msgbox("保存csv文件失败！", title="结束", ok_button="Fail")

后话

如果要直接输出Excel表格，则需要另安装和导入xlwt模块。实现的代码大致如下：

myxl = xlwt.Workbook()
style = xlwt.easyxf('align: wrap yes; align: horiz center; font: bold yes;borders:top thin; borders:bottom thin; borders:left thin; borders:right thin;')
sheet = myxl.add_sheet('考勤表')
wcol = [20,40,50,75,40,75]
for i in range(6):
sheet.col(i).width = wcol[i]*80
sheet.write_merge(0,0,0,8,'出勤统计报表',style)
style = xlwt.easyxf('borders:top thin; borders:bottom thin; borders:left thin; borders:right thin;')
sheet.write_merge(1,1,0,1,'单位（盖章）：',style)
sheet.write_merge(2,2,0,1,'*经办人：',style)
sheet.write(1,3,'填表日期：',style)
sheet.write_merge(1,1,4,8,strToday(),style)
sheet.write(2,3,'*联系电话：',style)
sheet.write(2,2,adminName,style)
sheet.write_merge(2,2,4,8,adminMobil,style)
for i,t in enumerate(head.strip().split(',')):
sheet.write(3,i,t,style)
with open('考勤表.csv', 'r') as f:
for i,row in enumerate(csv.reader(f)):
if i==0:continue
for j,col in enumerate(row):
sheet.write(3+i,j,col,style)
excelfile = 'Output_'+strDateTime()+'('+defaultValue+').xls'
myxl.save(excelfile)

另外不赶时间的话，还可以用PySimpleGUI库写个带漂亮gui界面的程序，具体操作方法请参见：

探索PySimpleGUI：一款简洁易用的图形用户界面库-CSDN博客文章浏览阅读1.9k次，点赞105次，收藏88次。PySimpleGUI是一个基于Tkinter、WxPython、Qt等底层库构建的图形界面框架，其设计目标是使Python GUI编程变得更加简单直观，大大降低了入门门槛。无论是初学者还是经验丰富的开发者，都可以快速上手并高效地创建出功能丰富、外观现代的桌面应用程序。PySimpleGUI的核心优势在于其高度抽象化的API设计，它提供了包括按钮、输入框、列表框、滑块等各种常见的GUI元素。除了基本的布局和样式设置，PySimpleGUI还支持事件驱动的编程模型。https://blog.csdn.net/boysoft2002/article/details/135315323

安装模块

读取、合并文件

PyPDF2

表格读取

pdfplumber

程序界面

easygui

后话文章来源地址https://www.toymoban.com/news/detail-777775.html

到了这里，关于Python 快速合并PDF表格转换输出CSV文件的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！