Python 网络爬虫 - 数据提取

分析网页意味着理解其结构。现在，问题出现了，为什么这对网络爬虫很重要？在本章中，让我们详细了解一下。

网页分析

网页分析很重要，因为如果不进行分析，我们就无法知道在提取该网页后，我们将以何种形式接收数据（结构化或非结构化）。我们可以通过以下方式进行网页分析：

查看页面源代码

这是一种通过检查网页的源代码来了解其结构的方式。要实现这一点，我们需要右键单击页面，然后选择“**查看页面源代码**”选项。然后，我们将以 HTML 的形式从该网页获取我们感兴趣的数据。但主要问题是关于空格和格式，这对我们来说很难格式化。

通过点击“检查元素”选项检查页面源代码

这是另一种分析网页的方式。但不同之处在于它将解决网页源代码中格式和空格的问题。您可以通过右键单击，然后从菜单中选择“**检查**”或“**检查元素**”选项来实现此操作。它将提供有关该网页特定区域或元素的信息。

从网页提取数据的不同方法

以下方法主要用于从网页提取数据：

正则表达式

它们是嵌入在 Python 中的高度专业化的编程语言。我们可以通过 Python 的**re** 模块来使用它。它也称为 RE 或正则表达式或正则表达式模式。借助正则表达式，我们可以为我们想要从数据中匹配的可能的字符串集指定一些规则。

如果您想更全面地了解正则表达式，请访问链接https://tutorialspoint.com/automata_theory/regular_expressions.htm，如果您想了解有关 re 模块或 Python 中正则表达式的更多信息，您可以访问链接 https://tutorialspoint.com/python/python_reg_expressions.htm。

示例

在以下示例中，我们将使用正则表达式匹配<td>的内容，从http://example.webscraping.com 中抓取关于印度的数据。

import re
import urllib.request
response =
   urllib.request.urlopen('http://example.webscraping.com/places/default/view/India-102')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)

输出

相应的输出将如下所示：

[
   '<img src="/places/static/images/flags/in.png" />',
   '3,287,590 square kilometres',
   '1,173,108,018',
   'IN',
   'India',
   'New Delhi',
   '<a href="/places/default/continent/AS">AS</a>',
   '.in',
   'INR',
   'Rupee',
   '91',
   '######',
   '^(\\d{6})$',
   'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
   '<div>
      <a href="/places/default/iso/CN">CN </a>
      <a href="/places/default/iso/NP">NP </a>
      <a href="/places/default/iso/MM">MM </a>
      <a href="/places/default/iso/BT">BT </a>
      <a href="/places/default/iso/PK">PK </a>
      <a href="/places/default/iso/BD">BD </a>
   </div>'
]

观察到在上面的输出中，您可以使用正则表达式看到关于印度国家的信息。

Learn Python in-depth with real-world projects through our Python certification course. Enroll and become a certified expert to boost your career.

Beautiful Soup

假设我们想从网页中收集所有超链接，那么我们可以使用一个名为 BeautifulSoup 的解析器，可以在https://www.crummy.com/software/BeautifulSoup/bs4/doc/.中详细了解它。简单来说，Beautiful Soup 是一个 Python 库，用于从 HTML 和 XML 文件中提取数据。它可以与 requests 一起使用，因为它需要一个输入（文档或 URL）来创建 soup 对象，因为它本身无法获取网页。您可以使用以下 Python 脚本收集网页标题和超链接。

安装 Beautiful Soup

使用**pip**命令，我们可以在我们的虚拟环境或全局安装中安装**beautifulsoup**。

(base) D:\ProgramData>pip install bs4
Collecting bs4
   Downloading
https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89
a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in d:\programdata\lib\sitepackages
(from bs4) (4.6.0)
Building wheels for collected packages: bs4
   Running setup.py bdist_wheel for bs4 ... done
   Stored in directory:
C:\Users\gaurav\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d
52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1

示例

请注意，在这个例子中，我们扩展了上面用 requests python 模块实现的例子。我们使用**r.text**来创建一个 soup 对象，该对象将进一步用于获取网页标题等详细信息。

首先，我们需要导入必要的 Python 模块：

import requests
from bs4 import BeautifulSoup

在下面的代码行中，我们使用 requests 对 URL：https://authoraditiagarwal.com/ 发出 GET HTTP 请求。

r = requests.get('https://authoraditiagarwal.com/')

现在我们需要创建一个 Soup 对象，如下所示：

soup = BeautifulSoup(r.text, 'lxml')
print (soup.title)
print (soup.title.text)

输出

相应的输出将如下所示：

<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal

Lxml

我们将要讨论的另一个用于网络爬虫的 Python 库是 lxml。它是一个高性能的 HTML 和 XML 解析库。它相对快速且简单。您可以在https://lxml.de/.上了解更多相关信息。

安装 lxml

使用 pip 命令，我们可以在我们的虚拟环境或全局安装中安装**lxml**。

(base) D:\ProgramData>pip install lxml
Collecting lxml
   Downloading
https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e
3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl
(3.
6MB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.5

示例：使用 lxml 和 requests 进行数据提取

在以下示例中，我们使用 lxml 和 requests 从**authoraditiagarwal.com**抓取网页的特定元素：

首先，我们需要导入 requests 和来自 lxml 库的 html，如下所示：

import requests
from lxml import html

现在我们需要提供要抓取的网页的 URL

url = https://authoraditiagarwal.com/leadershipmanagement/

现在我们需要提供该网页特定元素的路径**(Xpath)**：

path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response = requests.get(url)
byte_string = response.content
source_code = html.fromstring(byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())

输出

相应的输出将如下所示：

The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical 
axis represents the hours remaining to complete the committed work.

打印页面