Python – 使用 OCR（光学字符识别）读取 PDF 内容

PDF 代表便携式文档格式，是可以在设备之间交换的流行文件格式之一。由于 PDF 格式的文件包含无法更改的文本，因此它使用户能够更轻松地阅读并保持文件的格式稳定。尽管阅读 PDF 格式的文本很容易，但从中复制内容可能很耗时。为了使阅读过程更容易，使用了 OCR（光学字符识别）工具。

使用 OCR 读取 PDF 内容

在本文中，我们将讨论光学字符识别或 OCR，这是一种电子工具，可帮助将扫描的图像或手写文本转换为可编辑的计算机文件。Python 支持多个第三方库，这些库利用 OCR 技术从 PDF 中读取内容。

其中一个库是 Pytesseract。它是一个用于 Python 的光学字符识别 (OCR) 引擎，它在后台使用 Google 的 Tesseract-OCR。Pytesseract 可以识别超过 100 种语言（包括英语、印地语、阿拉伯语和中文等）的 PDF 文件中的文本。

光学字符识别

OCR 技术消除了手动阅读文档的过程并节省了时间。其应用不仅限于文档提取，还扩展到手写识别、身份证识别和身份验证。总之，考虑到 OCR 在处理图像或 PDF 文档时的各种用例，它是每个开发人员都应该熟悉的工具之一。

Python 提供了与许多市售 OCR 库（如 pytesseract）高效交互所需的灵活性，使我们的项目能够通过在大型数据集上扩展它们来实现简化运行，而无需人工干预。当我们将这种能力与自然语言处理 (NLP) 和对象检测等不同的机器学习概念相结合时，我们能够将计算机的程序化渲染能力推向何种程度就没有限制了。

Learn Python in-depth with real-world projects through our Python certification course. Enroll and become a certified expert to boost your career.

使用 try 和 except 方法的 Python 程序，用于使用 OCR 读取 PDF 内容

输入以 PDF 格式给出，并命名为 sample.pdf，然后使用光学字符识别工具识别 PDF 文件中的文本，最后返回示例文本。为此，使用了 try 和 except 方法。

算法

步骤 1 - 导入所需的模块，如 os 和 pytesseract。
步骤 2 - 从 PIL 包中导入 image 模块
步骤 3 - 使用名为“convert_from_path”的函数将给定的 pdf 文件转换为图像
步骤 4 - 该函数定义了一个参数作为输入文件名。
步骤 5 - 初始化空列表
步骤 6 - try 方法将把 PDF 文件中的每个文本转换为文本。
步骤 7 - 对于 images 列表中的每个图像，为每个图像生成文件名并以 JPEG 格式保存。
步骤 8 - 使用 pytesseract 模块提取文本，然后将其添加到初始化的空列表中。
步骤 9 - 如果在执行上述步骤时出现任何异常，则打印该异常。
步骤 10 - 通过从输入文件名中删除扩展名并附加 .txt 扩展名来生成输出文件名。
步骤 11 - 将提取的文本写入输出文件并返回输出文件名。
步骤 12 - 使用输入文件名定义 pdf_file 变量。
步骤 13 - 使用 pdf_file 变量作为输入调用 read_pdf 函数并打印其输出。

示例

# Importing the os module to perform file operations
import os  
# Importing the pytesseract module to extract text from images
import pytesseract as tess  
# Importing the Image module from the PIL package to work with images
from PIL import Image  
# Importing the convert_from_path function from the pdf2image module to convert PDF files to images
from pdf2image import convert_from_path  

#This function takes a PDF file name as input and returns the name of the text file that contains the extracted text.
def read_pdf(file_name):   
    # Store all pages of one file here:
    pages = []

    try:
        # Convert the PDF file to a list of PIL images:
        images = convert_from_path(file_name)  

        # Extract text from each image:
        for i, image in enumerate(images):
          # Generating filename for each image
            filename = "page_" + str(i) + "_" + os.path.basename(file_name) + ".jpeg"  
            image.save(filename, "JPEG")  
          # Saving each image as JPEG
            text = tess.image_to_string(Image.open(filename))  # Extracting text from each image using pytesseract
            pages.append(text)  
          # Appending extracted text to pages list

    except Exception as e:
        print(str(e))

    # Write the extracted text to a file:
    output_file_name = os.path.splitext(file_name)[0] + ".txt"  # Generating output file name
    with open(output_file_name, "w") as f:
        f.write("\n".join(pages))  
      # Writing extracted text to output file

    return output_file_name

#print function returns the final converted text 
pdf_file = "sample.pdf"
print(read_pdf(pdf_file))

输入

输出

结论

在 21 世纪，对于拥有大量数据的组织来说，数据处理是最具挑战性的任务，随着数据科学和机器学习的发展，访问数据变得更加容易。最适合传输且无需任何更改的文件是 pdf，因此这种方法可以帮助人们将其转换为文本文件。

Pranavnath

更新于： 2023-09-04

4K+ 阅读量

开启你的职业生涯

通过完成课程获得认证

开始学习

Python – 使用 OCR（光学字符识别）读取 PDF 内容

使用 OCR 读取 PDF 内容

光学字符识别

使用 try 和 except 方法的 Python 程序，用于使用 OCR 读取 PDF 内容

算法

示例

输入

输出

结论

开启你的 职业生涯

开启你的职业生涯