虽然使用Python处理 PDF 文档的方法有很多种,但我发现生成或编辑 HTML 比尝试弄清楚 PDF 格式的复杂性更容易、更可靠。当然,有令人尊敬的ReportLab,如果您不喜欢 HTML,我鼓励您研究该选项。还有PyPDF2。或者也许是PyPDF3?不,也许是PyPDF4!嗯...看到问题了吗?我最好的猜测是 PyPDF3,无论它的价值如何。
这么多选择...
但如果您熟悉 HTML,那么有一个简单的选择。
输入WeasyPrint。它需要 HTML 和 CSS,并将其转换为可用且可能美观的 PDF 文档。
可以在关联的 Github 存储库中访问本文中的代码示例。随意克隆和适应。
github.com/bowmanjd/pyweasyprintdemo
安装
要安装WeasyPrint,我建议您首先使用您选择的工具设置一个虚拟环境。
然后,安装就像在激活的虚拟环境中执行类似以下操作一样简单:
pip install weasyprin
上述方案的替代方案,具体取决于您的工具:
poetry add weasyprint conda install -c conda-forge weasyprint pipenv install weasyprint
你明白了。
如果您只需要weasyprint命令行工具,您甚至可以使用 pipx并使用pipx install weasyprint. 虽然这不会使作为 Python 库的访问变得非常方便,但如果您只想将网页转换为 PDF,这可能就是您所需要的。
命令行工具(Python 使用可选)
安装后,weasyprint命令行工具即可使用。您可以将 HTML 文件或网页转换为 PDF。例如,您可以尝试以下操作:
weasyprint \ "https://en.测试网址.org/wiki/Python_(programming_language)" \ python.pdf
python.pdf上面的命令将在当前工作目录中保存一个文件,该文件是从百科上的 Python 编程语言英文文章的HTML 转换而来的。它并不完美,但希望它能给你一个想法。
当然,您不必指定网址。本地 HTML 文件工作正常,并且它们提供对内容和样式的必要控制。
weasyprint sample.html out/sample.pdf
请随意下载sample.html与本文内容相关的sample.css样式表。
CSS
body { font-family: sans-serif; } code { font-family: monospace; background: #ccc; padding: 2px; } pre code { display: block; } img { display: block; margin-left: auto; margin-right: auto; width: 90%; } @media print { a::after { content: " (" attr(href) ") "; } pre { white-space: pre-wrap; } @page { margin: 0.75in; size: Letter; @top-right { content: counter(page); } } @page :first { @top-right { content: ""; } } }
HTML
<!DOCTYPE html> <html> <head> <title>PDF Generation with Python and WeasyPrint</title> <link href="sample.css" rel="stylesheet" /> </head> <body> <img src="https://dev-to-uploads.s3.amazonaws.com/i/03go0ipro79sbt8ir7oq.png" alt="Python and PDF" /> <h1>Python PDF Generation from HTML with WeasyPrint</h1> <p> While there are numerous ways to handle PDF documents with <a href="https://python.org">Python</a>, I find generating or editing HTML far easier and more reliable than trying to figure out the intricacies of the PDF format. Sure, there is the venerable <a href="https://pypi.org/project/reportlab/">ReportLab</a>, and if HTML is not your cup of tea, I encourage you to look into that option. There is also <a href="https://mstamy2.github.io/PyPDF2/">PyPDF2</a>. Or maybe <a href="https://github.com/sfneal/PyPDF3">PyPDF3</a>? No, perhaps <a href="https://github.com/claird/PyPDF4">PyPDF4</a>! Hmmm... see the problem? My best guess is PyPDF3, for what that is worth. </p> <p>So many choices...</p> <p> <img src="https://dev-to-uploads.s3.amazonaws.com/i/omcprzuh7n6u0nyzshqv.png" alt="So many choices in the cereal aisle" /> </p> <p>But there is an easy choice if you are comfortable with HTML.</p> <p> Enter <a href="https://weasyprint.org/">WeasyPrint</a>. It takes HTML and CSS, and converts it to a usable and potentially beautiful PDF document. </p> <blockquote> <p> The code samples in this article can be accessed in <a href="https://github.com/bowmanjd/pyweasyprintdemo" >the associated Github repo</a >. Feel free to clone and adapt. </p> </blockquote> <h2>Installation</h2> <p> To install <a href="https://weasyprint.org/">WeasyPrint</a>, I recommend you first <a href="https://dev.to/bowmanjd/python-tools-for-managing-virtual-environments-3bko" >set up a virtual environment with the tool of your choice</a >. </p> <p> Then, installation is as simple as performing something like the following in an activated virtual environment: </p> <pre><code class="language-console">pip install weasyprint </code></pre> <p>Alternatives to the above, depending on your tooling:</p> <ul> <li><code>poetry add weasyprint</code></li> <li><code>conda install -c conda-forge weasyprint</code></li> <li><code>pipenv install weasyprint</code></li> </ul> <p>You get the idea.</p> <p> If you only want the <code>weasyprint</code> command-line tool, you could even <a href="https://dev.to/bowmanjd/how-do-i-install-a-python-command-line-tool-or-script-hint-pipx-3i2" >use pipx</a > and install with <code>pipx install weasyprint</code>. While that would not make it very convenient to access as a Python library, if you just want to convert web pages to PDFs, that may be all you need. </p> <h2>A command line tool (Python usage optional)</h2> <p> Once installed, the <code>weasyprint</code> command line tool is available. You can convert an HTML file or a web page to PDF. For instance, you could try the following: </p> <pre><code class="language-console">weasyprint \ "https://en.网址.org/wiki/Python_(programming_language)" \ python.pdf </code></pre> <p> The above command will save a file <code>python.pdf</code> in the current working directory, converted from the HTML from the <a href="https://en.网址.org/wiki/Python_(programming_language)" >Python programming language article in English on 网址</a >. It ain't perfect, but it gives you an idea, hopefully. </p> <p> You don't have to specify a web address, of course. Local HTML files work fine, and they provide necessary control over content and styling. </p> <pre><code class="language-console">weasyprint sample.html out/sample.pdf </code></pre> <p> Feel free to <a href="https://raw.githubusercontent.com/bowmanjd/pyweasyprintdemo/main/sample.html" >download a <code>sample.html</code></a > and an associated <a href="https://raw.githubusercontent.com/bowmanjd/pyweasyprintdemo/main/sample.css" ><code>sample.css</code> stylesheet</a > with the contents of this article. </p> <p> See <a href="https://weasyprint.readthedocs.io/en/latest/tutorial.html#as-a-standalone-program" >the WeasyPrint docs</a > for further examples and instructions regarding the standalone <code>weasyprint</code> command line tool. </p> <h2>Utilizing WeasyPrint as a Python library</h2> <p> The <a href="https://weasyprint.readthedocs.io/">Python API for WeasyPrint</a> is quite versatile. It can be used to load HTML when passed appropriate file pointers, file names, or the text of the HTML itself. </p> <p> Here is an example of a simple <code>makepdf()</code> function that accepts an HTML string, and returns the binary PDF data. </p> <pre><code class="language-python">from weasyprint import HTML def makepdf(html): """Generate a PDF file from a string of HTML.""" htmldoc = HTML(string=html, base_url="") return htmldoc.write_pdf() </code></pre> <p> The main workhorse here is the <code>HTML</code> class. When instantiating it, I found I needed to pass a <code>base_url</code> parameter in order for it to load images and other assets from relative urls, as in <code><img src="somefile.png"></code>. </p> <p> Using <code>HTML</code> and <code>write_pdf()</code>, not only will the HTML be parsed, but associated CSS, whether it is embedded in the head of the HTML (in a <code><style></code> tag), or included in a stylesheet (with a <code ><link href="sample.css" rel="stylesheet"\></code > tag). </p> <p> I should note that <code>HTML</code> can load straight from files, and <code>write_pdf()</code> can write to a file, by specifying filenames or file pointers. See <a href="https://weasyprint.readthedocs.io/">the docs</a> for more detail. </p> <p> Here is a more full-fledged example of the above, with primitive command line handling capability added: </p> <pre><code class="language-python">from pathlib import Path import sys from weasyprint import HTML def makepdf(html): """Generate a PDF file from a string of HTML.""" htmldoc = HTML(string=html, base_url="") return htmldoc.write_pdf() def run(): """Command runner.""" infile = sys.argv[1] outfile = sys.argv[2] html = Path(infile).read_text() pdf = makepdf(html) Path(outfile).write_bytes(pdf) if __name__ == "__main__": run() </code></pre> <p> You may <a href="https://raw.githubusercontent.com/bowmanjd/pyweasyprintdemo/main/weasyprintdemo.py" >download the above file</a > directly, or <a href="https://github.com/bowmanjd/pyweasyprintdemo" >browse the Github repo</a >. </p> <blockquote> <p> A note about Python types: the <code>string</code> parameter when instantiating <code>HTML</code> is a normal (Unicode) <code>str</code>, but <code>makepdf()</code> outputs <code>bytes</code>. </p> </blockquote> <p> Assuming the above file is in your working directory as <code>weasyprintdemo.py</code> and that a <code>sample.html</code> and an <code>out</code> directory are also there, the following should work well: </p> <pre><code class="language-console">python weasyprintdemo.py sample.html out/sample.pdf </code></pre> <p> Try it out, then open <code>out/sample.pdf</code> with your PDF reader. Are we close? </p> <h2>Styling HTML for print</h2> <p> As is probably apparent, using WeasyPrint is easy. The real work with HTML to PDF conversion, however, is in the styling. Thankfully, CSS has pretty good support for printing. </p> <p>Some useful CSS print resources:</p> <ul> <li> <a href="https://css-tricks.com/tag/print-stylesheet/" >Various articles on CSS-Tricks</a > </li> <li> <a href="https://flaviocopes.com/css-printing/#print-css" >A nice summary on flaviocopes</a > </li> <li> <a href="https://developer.mozilla.org/en-US/docs/Web/Guide/Printing" >The MDN web docs</a > </li> </ul> <p>This simple stylesheet demonstrates a few basic tricks:</p> <pre><code class="language-css">body { font-family: sans-serif; } @media print { a::after { content: " (" attr(href) ") "; } pre { white-space: pre-wrap; } @page { margin: 0.75in; size: Letter; @top-right { content: counter(page); } } @page :first { @top-right { content: ""; } } } </code></pre> <p> First, use <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/Media_Queries/Using_media_queries" >media queries</a >. This allows you to use the same stylesheet for both print and screen, using <code>@media print</code> and <code>@media screen</code> respectively. In the example stylesheet, I assume that the defaults (such as seen in the <code>body</code> declaration) apply to all formats, and that <code>@media print</code> provides overrides. Alternatively, you could include separate stylesheets for print and screen, using the <code>media</code> attribute of the <code><link></code> tag, as in <code ><link rel="stylesheet" src="print.css" media="print" /></code >. </p> <p> Second, <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/@page" >use <code>@page</code> CSS rules</a >. While <a href="https://caniuse.com/mdn-css_at-rules_page_size" >browser support</a > is pretty abysmal in 2020, WeasyPrint does a pretty good job of supporting what you need. Note the margin and size adjustments above, and the page numbering, in which we first define a counter in the top-right, then override with <code>:first</code> to make it blank on the first page only. In other words, page numbers only show from page 2 onward. </p> <p> Also note the <code>a::after</code> trick to explicitly display the <code>href</code> attribute when printing. This is either clever or annoying, depending on your goals. </p> <p> Another hint, not demonstrated above: within the <code>@media print</code> block, set <code>display: none</code> on any elements that don't need to be printed, and set <code>background: none</code> where you don't want backgrounds printed. </p> <h2>Django and Flask support</h2> <p> If you write <a href="https://www.djangoproject.com/">Django</a> or <a href="https://flask.palletsprojects.com/">Flask</a> apps, you may benefit from the convenience of the respective libraries for generating PDFs within these frameworks: </p> <ul> <li> <a href="https://github.com/fdemmer/django-weasyprint" >django-weasyprint</a > provides a <code>WeasyTemplateView</code> view base class or a <code>WeasyTemplateResponseMixin</code> mixin on a TemplateView </li> <li> <a href="https://pythonhosted.org/Flask-WeasyPrint/" >Flask-WeasyPrint</a > provides a special <code>HTML</code> class that works just like WeasyPrint's, but respects Flask routes and WSGI. Also provided is a <code>render_pdf</code> function that can be called on a template or on the <code>url_for()</code> of another view, setting the correct mimetype. </li> </ul> <h2>Generate HTML the way you like</h2> <p> WeasyPrint encourages the developer to make HTML and CSS, and the PDF just happens. If that fits your skill set, then you may enjoy experimenting with and utilizing this library. </p> <p><em>How</em> you generate HTML is entirely up to you. You might:</p> <ul> <li> Write HTML from scratch, and use <a href="https://jinja.palletsprojects.com/">Jinja templates</a> for variables and logic. </li> <li> Write Markdown and convert it to HTML with <a href="https://github.com/theacodes/cmarkgfm">cmarkgfm</a> or <a href="https://dev.to/bowmanjd/processing-markdown-in-python-using-available-commonmark-implementations-cmarkgfm-paka-cmark-and-mistletoe-350a" >other Commonmark implementation</a >. </li> <li> Generate HTML Pythonically, with <a href="https://github.com/Knio/dominate/">Dominate</a> or <a href="https://lxml.de/tutorial.html#the-e-factory" >lxml's E factory</a > </li> <li> Parse, modify, and prettify your HTML (or HTML written by others) with <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" >BeautifulSoup</a > </li> </ul> <p>Then generate the PDF using WeasyPrint.</p> <p>Anything I missed? Feel free to leave comments!</p> </body> </html>
有关独立命令行工具的更多示例和说明,请参阅WeasyPrint 文档。
(https://weasyprint.readthedocs.io/en/latest/tutorial.html#as-a-standalone-program)
weasyprint
使用 WeasyPrint 作为 Python 库
WeasyPrint 的 Python API非常通用。当传递适当的文件指针、文件名或 HTML 本身的文本时,它可用于加载 HTML。
下面是一个简单makepdf()函数的示例,它接受 HTML 字符串并返回二进制 PDF 数据。
from weasyprint import HTML def makepdf(html): """Generate a PDF file from a string of HTML.""" htmldoc = HTML(string=html, base_url="") return htmldoc.write_pdf()
这里的主要工作是HTML班级。实例化它时,我发现我需要传递一个base_url参数,以便它从相对 URL 加载图像和其他资源,如<img src="somefile.png">.
使用HTMLand write_pdf(),不仅会解析 HTML,还会解析关联的 CSS,无论它是嵌入 HTML 的头部(在标签中<style>),还是包含在样式表中(带有标签<link href="sample.css" rel="stylesheet"\>)。
我应该注意,HTML可以直接从文件加载,并且write_pdf()可以通过指定文件名或文件指针写入文件。有关更多详细信息,请参阅文档。
(https://weasyprint.readthedocs.io/)
这是上面的一个更成熟的示例,添加了原始命令行处理功能:
from pathlib import Path import sys from weasyprint import HTML def makepdf(html): """Generate a PDF file from a string of HTML.""" htmldoc = HTML(string=html, base_url="") return htmldoc.write_pdf() def run(): """Command runner.""" infile = sys.argv[1] outfile = sys.argv[2] html = Path(infile).read_text() pdf = makepdf(html) Path(outfile).write_bytes(pdf) if __name__ == "__main__": run()
您可以直接下载上述文件,或者浏览Github repo。(https://github.com/bowmanjd/pyweasyprintdemo)
"""Generate PDF from HTML.""" from pathlib import Path import sys from weasyprint import HTML def makepdf(html): """Generate a PDF file from a string of HTML.""" htmldoc = HTML(string=html, base_url="") return htmldoc.write_pdf() def run(): """Command runner.""" infile = sys.argv[1] outfile = sys.argv[2] html = Path(infile).read_text() pdf = makepdf(html) Path(outfile).write_bytes(pdf) if __name__ == "__main__": run()
文章来源地址https://www.toymoban.com/diary/python/309.html
关于Python类型的说明:在实例化HTML时,字符串参数是普通的Unicode str类型,
但是makepdf()方法输出的是字节(bytes)类型
假设上述文件以weasyprintdemo.py的形式存在于您的工作目录中,并且还有一个sample.html文件和一个名为out的目录,那么以下内容应该能够正常工作:
python weasyprintdemo.py sample.html out/sample.pdf
尝试一下,然后out/sample.pdf用 PDF 阅读器打开。我们很亲近吗?
打印 HTML 样式
显而易见,使用 WeasyPrint 很容易。然而,HTML 到 PDF 转换的真正工作在于样式。值得庆幸的是,CSS 对打印有很好的支持。
一些有用的 CSS 打印资源:
有关 CSS 技巧的各种文章 https://css-tricks.com/tag/print-stylesheet/
关于 flaviocopes 的一个很好的总结 https://flaviocopes.com/css-printing/#print-css
MDN 网络文档 https://developer.mozilla.org/en-US/docs/Web/Guide/Printing
这个简单的样式表演示了一些基本技巧:
body { font-family: sans-serif; } @media print { a::after { content: " (" attr(href) ") "; } pre { white-space: pre-wrap; } @page { margin: 0.75in; size: Letter; @top-right { content: counter(page); } } @page :first { @top-right { content: ""; } } }
首先,使用媒体查询(media queries)。这允许您在打印和屏幕上使用相同的样式表,分别使用@media print和@media screen。在示例样式表中,我假设默认值(如body声明中所见)适用于所有格式,并且@media print提供了覆盖样式。或者,您可以使用<link>标签的media属性,在打印和屏幕上分别包含单独的样式表,例如<link rel="stylesheet" src="print.css" media="print" />。
其次,使用@page CSS规则。虽然2020年浏览器支持情况相当糟糕,但WeasyPrint在支持所需功能方面做得很好。请注意上述代码中的边距和大小调整以及页面编号。其中,我们首先在右上角定义一个计数器,然后使用:first来使其在第一页上为空白。换句话说,页码只会从第二页开始显示。
还请注意a::after的技巧,在打印时明确显示href属性。这可能要根据您的目标来判断,有些人可能会认为这个技巧很聪明,有些人可能会觉得有些烦人。
另一个提示,上述示例中没有演示的是:在@media print块中,将display: none设置为不需要打印的任何元素,并在不希望背景被打印的地方设置background: none。
Django 和 Flask 支持
如果您使用Django或Flask应用程序,您可能会受益于这些框架中用于生成PDF的方便库:
django-weasyprint提供了一个WeasyTemplateView视图基类或在TemplateView上提供的WeasyTemplateResponseMixin混合类。
Flask-WeasyPrint提供了一个特殊的HTML类,其工作方式与WeasyPrint相同,但同时支持Flask的路由和WSGI。还提供了一个render_pdf函数,可以在模板上调用该函数,也可以在其他视图的url_for()上调用该函数,并设置正确的MIME类型。
生成HTML的方式完全取决于您。以下是一些可能的方法:
从头开始编写HTML,并使用Jinja模板处理变量和逻辑。
使用cmarkgfm或其他Commonmark实现将Markdown转换为HTML。
使用Dominate或lxml的E工厂以Python的方式生成HTML。
使用BeautifulSoup解析、修改和美化您的HTML(或他人编写的HTML)。
然后使用WeasyPrint生成PDF。
如果我漏掉了什么,请随时留下评论!文章来源:https://www.toymoban.com/diary/python/309.html
到此这篇关于使用WeasyPrint将HTML转换为Python PDF生成的文章就介绍到这了,更多相关内容可以在右上角搜索或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!