Python 网络爬虫 - 处理验证码

在本章中，让我们了解如何执行网络爬虫以及处理用于测试用户是人类还是机器的验证码。

什么是验证码？

CAPTCHA 的全称是 **Completely Automated Public Turing test to tell Computers and Humans Apart**，这清楚地表明它是一种用于确定用户是否为人类的测试。

验证码是一个扭曲的图像，通常不容易被计算机程序检测到，但人类却可以设法理解它。大多数网站使用验证码来防止机器人进行交互。

使用 Python 加载验证码

假设我们想要在一个网站上进行注册，并且有一个包含验证码的表单，那么在加载验证码图像之前，我们需要了解表单所需的特定信息。借助下面的 Python 脚本，我们可以了解名为 http://example.webscrapping.com 网站上注册表单的表单要求。

import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
def form_parsing(html):
   tree = lxml.html.fromstring(html)
   data = {}
   for e in tree.cssselect('form input'):
      if e.get('name'):
         data[e.get('name')] = e.get('value')
   return data
REGISTER_URL = '<a target="_blank" rel="nofollow" 
   href="http://example.webscraping.com/user/register">http://example.webscraping.com/user/register'</a>
ckj = cookielib.CookieJar()
browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj))
html = browser.open(
   '<a target="_blank" rel="nofollow" 
      href="http://example.webscraping.com/places/default/user/register?_next">
      http://example.webscraping.com/places/default/user/register?_next</a> = /places/default/index'
).read()
form = form_parsing(html)
pprint.pprint(form)

在上面的 Python 脚本中，我们首先定义了一个函数，该函数将使用 lxml Python 模块解析表单，然后打印表单要求如下：

{
   '_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab',
   '_formname': 'register',
   '_next': '/places/default/index',
   'email': '',
   'first_name': '',
   'last_name': '',
   'password': '',
   'password_two': '',
   'recaptcha_response_field': None
}

您可以从上面的输出中检查到，除了 **recpatcha_response_field** 之外，所有信息都易于理解和直接。现在问题出现了，我们如何处理这些复杂的信息并下载验证码？这可以通过使用 Pillow Python 库来实现，如下所示：

Pillow Python 包

Pillow 是 Python 图像库的一个分支，包含用于操作图像的有用函数。可以使用以下命令进行安装：

pip install pillow

在接下来的示例中，我们将使用它来加载验证码：

from io import BytesIO
import lxml.html
from PIL import Image
def load_captcha(html):
   tree = lxml.html.fromstring(html)
   img_data = tree.cssselect('div#recaptcha img')[0].get('src')
   img_data = img_data.partition(',')[-1]
   binary_img_data = img_data.decode('base64')
   file_like = BytesIO(binary_img_data)
   img = Image.open(file_like)
   return img

上面的 Python 脚本使用了 **pillow** Python 包并定义了一个用于加载验证码图像的函数。它必须与前面脚本中定义的名为 **form_parser()** 的函数一起使用，以获取有关注册表单的信息。此脚本将以有用的格式保存验证码图像，该图像可以进一步提取为字符串。

OCR：使用 Python 从图像中提取文本

在以有用的格式加载验证码后，我们可以借助光学字符识别 (OCR) 来提取它，OCR 是从图像中提取文本的过程。为此，我们将使用开源的 Tesseract OCR 引擎。可以使用以下命令进行安装：

pip install pytesseract

示例

在这里，我们将扩展上面使用 Pillow Python 包加载验证码的 Python 脚本，如下所示：

import pytesseract
img = get_captcha(html)
img.save('captcha_original.png')
gray = img.convert('L')
gray.save('captcha_gray.png')
bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
bw.save('captcha_thresholded.png')

上面的 Python 脚本将以黑白模式读取验证码，这将更加清晰，易于传递给 tesseract，如下所示：

pytesseract.image_to_string(bw)

运行上述脚本后，我们将获得注册表单的验证码作为输出。

打印页面