使用 Python 中的 Beautiful Soup 提取属性值

要借助 Beautiful Soup 提取属性值，我们需要解析 HTML 文档，然后提取所需的属性值。Beautiful Soup 是一个用于解析 HTML 和 XML 文档的 Python 库。Beautiful Soup 提供了几种方法来搜索和导航解析树，从而轻松地从文档中提取数据。在本文中，我们将学习如何使用 Python 中的 Beautiful Soup 提取属性值。

算法

您可以按照以下算法使用 Python 中的 Beautiful Soup 提取属性值。

使用 bs4 库中的 BeautifulSoup 类解析 HTML 文档。
使用适当的 BeautifulSoup 方法（例如 find() 或 find_all()）查找包含要提取的属性的 HTML 元素。
使用条件语句或 has_attr() 方法检查元素中是否存在该属性。
如果存在该属性，则使用方括号 ([]) 和属性名称作为键来提取其值。
如果不存在该属性，则相应地处理错误。

安装 Beautiful Soup

在使用 Beautiful Soup 库之前，您需要使用 Python 包管理器（即 pip 命令）安装它。要在终端或命令提示符中安装 Beautiful Soup，请键入以下命令。

pip install beautifulsoup4

提取属性值

要从 HTML 标签中提取属性值，我们首先需要使用 BeautifulSoup 解析 HTML 文档。然后使用 Beautiful Soup 方法提取 HTML 文档中特定标签的属性值。

示例 1：使用 find() 方法和方括号提取 href 属性

在下面的示例中，我们首先创建了一个 HTML 文档，并将其作为字符串传递给 Beautiful Soup 构造函数，并使用解析器类型 html.parser。接下来，我们使用 soup 对象的 find() 方法查找“a”标签。这将返回 HTML 文档中“a”标签的第一次出现。最后，我们使用方括号表示法从“a”标签中提取 href 属性的值。这将返回 href 属性的值作为字符串。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <a href="https://www.google.com">Google</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find the 'a' tag
a_tag = soup.find('a')

# Extract the value of the 'href' attribute
href_value = a_tag['href']

print(href_value)

输出

https://www.google.com

示例 2：使用 attrs 查找具有特定属性的元素

在下面的示例中，我们使用 find_all() 方法查找所有具有 href 属性的 `a` 标签。attrs 参数用于指定我们正在查找的属性。{‘href’: True} 指定我们正在查找具有 href 属性（任何值）的元素。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <a href="https://www.google.com">Google</a>
   <a href="https://pythonlang.cn">Python</a>
   <a>No Href</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'a' tags with an 'href' attribute
a_tags_with_href = soup.find_all('a', attrs={'href': True})
for tag in a_tags_with_href:
   print(tag['href'])

输出

https://www.google.com
https://pythonlang.cn

示例 3：使用 find_all() 查找元素的所有出现

有时，您可能希望查找网页上 HTML 元素的所有出现。您可以使用 find_all() 方法实现此目的。在下面的示例中，我们使用 find_all() 方法查找所有具有类 container 的 div 标签。然后，我们循环遍历每个 div 标签，并在其中查找 h1 和 p 标签。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <div class="container">
      <h1>Heading 1</h1>
      <p>Paragraph 1</p>
   </div>
   <div class="container">
      <h1>Heading 2</h1>
      <p>Paragraph 2</p>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'div' tags with class='container'
div_tags = soup.find_all('div', class_='container')
for div in div_tags:
   h1 = div.find('h1')
   p = div.find('p')
   print(h1.text, p.text)

输出

Heading 1 Paragraph 1
Heading 2 Paragraph 2

示例 4：使用 select() 查找具有 CSS 选择器的元素

在下面的示例中，我们使用 select() 方法查找具有类 container 的 div 标签内的所有 h1 标签。CSS 选择器 'div.container h1' 用于实现此目的。. 用于表示类名，而空格用于表示后代选择器。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <div class="container">
      <h1>Heading 1</h1>
      <p>Paragraph 1</p>
   </div>
   <div class="container">
      <h1>Heading 2</h1>
      <p>Paragraph 2</p>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'h1' tags inside a 'div' tag with class='container'
h1_tags = soup.select('div.container h1')
for h1 in h1_tags:
   print(h1.text)

输出

Heading 1
Heading 2

结论

在本文中，我们讨论了如何使用 Python 中的 Beautiful Soup 库从 HTML 文档中提取属性值。通过使用 BeautifulSoup 提供的方法，我们可以轻松地从 HTML 和 XML 文档中提取所需的数据。

Rohan Singh

更新于：2023年7月10日

8K+ 次浏览

启动您的职业生涯

完成课程获得认证

开始学习