Python - 处理 PDF



Python 可以读取 PDF 文件,并在从中提取文本后打印出内容。为此,我们首先必须安装所需的模块,即 PyPDF2。以下是安装模块的命令。你的 python 环境中应已安装 pip。

pip install pypdf2

成功安装此模块后,我们可以使用模块中提供的各种方法读取 PDF 文件。

import PyPDF2

pdfName = 'path\Tutorialspoint.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content

当我们运行上述程序时,得到以下输出 −

Tutorials Point originated from the idea that there exists a class of readers who respond better 
to online content and prefer to learn new skills at their own pace from the comforts of their 
drawing rooms.
 
The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
it generated, we worked our way to adding fresh tutorials to our repository which now 
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.

读取多页

要以页码的形式读取带有若干页的 pdf,并分别打印每一页,我们使用循环和 getPageNumber() 函数。在下面的示例中,我们有包含两页的 PDF 文件。内容打印在两个单独的页眉下。

import PyPDF2

pdfName = 'Path\Tutorialspoint2.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)

for i in xrange(read_pdf.getNumPages()):
    page = read_pdf.getPage(i)
    print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
    page_content = page.extractText()
    print page_content

当我们运行上述程序时,得到以下输出 −

Page No - 1
Tutorials Point originated from the idea that there exists a class of readers who respond better to 
online content and prefer to learn new skills at their own pace from the comforts of their drawing 
rooms. 


Page No - 2
 
The journey commenced with a single tutorial on HTML in 2006 and elated by the response it 
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
a wealth of tutorials and allied articles on topics ranging from p
rogramming languages to web 
designing to academics and much more.
 
广告