在 Python 中处理 PDF 文件?


Python 是一种非常通用的语言,因为它提供了大量的库来满足不同的需求。我们都使用可移植文档格式 (PDF) 文件。Python 提供了多种处理 pdf 文件的方法。在这里,我们将使用名为 PyPDF2 的 python 库来处理 pdf 文件。

PyPDF2 是一个纯 Python 的 PDF 库,能够分割、合并、裁剪和转换 PDF 文件的页面。它还可以向 PDF 文件添加自定义数据、查看选项和密码。它还可以从 PDF 中检索文本和元数据,以及将整个文件合并在一起。

由于我们可以使用 PyPDF2 对 PDF 执行多种操作,因此它就像一把瑞士军刀。

入门

因为 pypdf2 是一个标准的 python 包,所以我们需要安装它。好消息是它非常简单,我们可以使用 pip 来安装它。只需在你的命令终端运行以下命令即可

C:\Users\rajesh>pip install pypdf2
Collecting pypdf2
Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
100% |████████████████████████████████| 81kB 83kB/s
Building wheels for collected packages: pypdf2
Building wheel for pypdf2 (setup.py) ... done
Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\53\84\19\35bc977c8bf5f0c23a8a011aa958acd4da4bbd7a229315c1b7
Successfully built pypdf2
Installing collected packages: pypdf2
Successfully installed pypdf2-1.26.0

要验证,请从 python shell 导入 pypdf2

>>> import PyPDF2
>>>
Successful, Great.

提取元数据

我们可以从任何 pdf 中提取一些重要的有用数据。例如,我们可以提取文档作者的信息、标题、主题以及 pdf 文件中包含的页面数量。

以下是使用 pypdf2 包从 pdf 文件中提取有用信息的 python 程序。

from PyPDF2 import PdfFileReader
def extract_pdfMeta(path):
   with open(path, 'rb') as f:
      pdf = PdfFileReader(f)
      info = pdf.getDocumentInfo()
      number_of_pages = pdf.getNumPages()
   print("Author: \t", info.author)
   print()
   print("Creator: \t", info.creator)
   print()
   print("Producer: \t",info.producer)
   print()
   print("Subject: \t", info.subject)
   print()
   print("title: \t",info.title)
   print()
   print("Number of Pages in pdf: \t",number_of_pages)
if __name__ == '__main__':
   path = 'DeepLearning.pdf'
   extract_pdfMeta(path)

输出

Author: Nikhil Buduma,Nicholas Locascio

Creator: AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST)

Producer: Antenna House PDF Output Library 6.2.609 (Linux64)

Subject: None

title: Fundamentals of Deep Learning

Number of Pages in pdf: 298

因此,无需打开 pdf 文件,我们就可以从 pdf 文件中获取一些有用的信息。

从 PDF 中提取文本

我们可以从 pdf 中提取文本。虽然它内置支持提取图像。

让我们尝试从我们上面下载的 pdfs 文件的特定页面(例如:第 50 页)中提取文本。

#Import pypdf2
from PyPDF2 import PdfFileReader
def extract_pdfText(path):
   with open(path, 'rb') as f:
      pdf = PdfFileReader(f)
      # get the 50th page
      page = pdf.getPage(50)
      print(page)
      print('Page type: {}'.format(str(type(page))))
      #Extract text from the 50th page
      text = page.extractText()
      print(text)
if __name__ == '__main__':
   path = 'DeepLearning.pdf'
   extract_pdfText(path)

输出

{'/Annots': IndirectObject(1421, 0),
'/Contents': IndirectObject(179, 0),
'/CropBox': [0, 0, 595.3, 841.9],
'/Group': {'/CS': '/DeviceRGB', '/S': '/Transparency', '/Type': '/Group'},
'/MediaBox': [0, 0, 504, 661.5],
'/Parent': IndirectObject(4863, 0),
'/Resources': IndirectObject(1423, 0),
'/Rotate': 0,
'/Type':
'/Page'
}

Page type: <class 'PyPDF2.pdf.PageObject'>
time. In inverted dropout, any neuron whose activation hasn†t been silenced has its
output divided by p before the value is propagated to the next layer. With this
fix, Eoutput=p⁄xp+1ƒ
p⁄0=
x, and we can avoid arbitrarily scaling neuronal
output at test time.

SummaryIn this chapter, we†ve learned all of the basics involved in training feed-forward neural
networks. We†ve talked about gradient descent, the backpropagation algorithm, as
well as various methods we can use to prevent overfitting. In the next chapter, we†ll
put these lessons into practice when we use the TensorFlow library to efficiently
implement our first neural networks. Then in
Chapter 4

, we†ll return to the problem
of optimizing objective functions for training neural networks and design algorithmsto significantly improve performance. These improvements will enable us to process
much more data, which means we†ll be able to build more comprehensive models.
Summary | 37

虽然我们能够从第 50 页获取一些文本,但它并不那么干净。不幸的是,pypdf2 对从 pdf 中提取文本的支持非常有限。

旋转 pdf 文件的特定页面

>>> import PyPDF2
>>> deeplearningFile = open('DeepLearning.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(deeplearningFile)
>>> page = pdfReader.getPage(0)
>>> page.rotateClockwise(90)
{
'/Contents': [IndirectObject(4870, 0), IndirectObject(4871, 0), IndirectObject(4872, 0), IndirectObject(4873, 0), IndirectObject(4874, 0), IndirectObject(4875, 0), IndirectObject(4876, 0), IndirectObject(4877, 0)],

'/CropBox': [0, 0, 595.3, 841.9],

'/MediaBox': [0, 0, 504, 661.5], '/Parent': IndirectObject(4862, 0), '/Resources': IndirectObject(4889, 0),
'/Rotate': 90,
/Type': '/Page'
}
>>> pdfWriter = PyPDF2.PdfFileWriter()
>>> pdfWriter.addPage(page)
>>> resultPdfFile = open('rotatedPage.pdf', 'wb')
>>> pdfWriter.write(resultPdfFile)
>>> resultPdfFile.close()
>>> deeplearningFile.close()

输出

更新于: 2019-07-30

1K+ 次查看

开启你的 职业生涯

通过完成课程获得认证

开始
广告