如何在 Python 中解析本地 HTML 文件？

在处理网页抓取、数据分析和自动化时，使用 Python 解析本地 HTML 文件是一项常见任务。

在这篇文章中，我们将学习如何在 Python 中解析本地 HTML 文件。我们将探索使用 Python 从 HTML 文件中提取数据的各种技术。我们将涵盖修改和删除文件中的元素、打印数据、使用递归子生成器遍历文件结构、查找标签子元素，甚至通过从给定链接中提取信息来进行网页抓取。通过代码示例和语法，我们将演示如何利用 BeautifulSoup 和 lxml 等 Python 库来高效地完成这些任务。

设置环境

在深入解析 HTML 文件之前，让我们确保我们的 Python 环境已安装必要的库。我们将主要依赖两个流行的库：BeautifulSoup 和 lxml。要安装它们，请使用以下 pip 命令

pip install beautifulsoup4
pip install lxml

安装完成后，我们就可以开始解析本地 HTML 文件并提取数据了。我们可以使用多种技术，例如修改文件、遍历 HTML 结构、网页抓取等。让我们详细了解其中的一些技术，并附带语法和完整的示例

加载和修改 HTML 文件

要解析 HTML 文件，我们需要将其加载到我们的 Python 脚本中。我们可以使用内置的 open 函数打开文件，然后读取其内容来实现这一点。以下是一个示例

语法

with open('example.html', 'r') as file:
    html_content = file.read()

加载 HTML 文件后，我们可以使用字符串操作技术或 BeautifulSoup 等库提供的更高级方法来修改其内容。例如，要从 HTML 文件中删除特定元素，我们可以使用 BeautifulSoup 的 extract 方法

输入 HTML 文件

#myhtml.html

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
</head>
<body>
  <div class="my-class">
      Hello World
  </div>
</body>
</html>

示例

在这个示例中，我们加载了 HTML 文件（'myhtml.html'），创建了一个 BeautifulSoup 对象，使用其标签和属性找到了要删除的元素，最后将其从 HTML 结构中删除。可以使用 prettify 方法打印修改后的 HTML，以可视化更改。

from bs4 import BeautifulSoup

# Load the HTML file
with open('myhtml.html', 'r') as file:
    html_content = file.read()

# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'lxml')

# Find the element to remove by its tag and remove it
element_to_remove = soup.find('div', {'class': 'my-class'})
element_to_remove.extract()

# Print the modified HTML
print(soup.prettify())

输出

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Document
  </title>
 </head>
 <body>
 </body>
</html>

从 HTML 文件中提取数据

打印或从 HTML 文件中提取特定数据涉及到导航其结构。BeautifulSoup 提供了一系列方法来完成此操作。要提取数据，我们通常需要使用其标签、类或属性找到所需的元素或元素。

例如，让我们考虑一个包含以下结构的文章列表的 HTML 文件

示例

在这个示例中，我们加载了 HTML 文件，创建了一个 BeautifulSoup 对象，找到了 ul 元素，然后提取了其中的所有 li 元素。最后，我们打印了每个 li 元素的文本内容，它代表文章标题。

HTML

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
</head>
<body>
  <div class="">
      <ul>
        <li>Article 1</li>
        <li>Article 2</li>
        <li>Article 3</li>
      </ul>
  </div>
</body>
</html>

Python

from bs4 import BeautifulSoup

# Load the HTML file
with open('myhtml.html', 'r') as file:
    html_content = file.read()

# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'lxml')

# Find all li elements within the ul tag
articles = soup.find('ul').find_all('li')

# Print the article titles
for article in articles:
    print(article.text)

输出

Article 1
Article 2
Article 3

使用递归子生成器遍历 HTML 结构

递归子生成器是遍历 HTML 文件结构的强大技术。BeautifulSoup 允许我们使用 .children 属性迭代标签的子元素。我们可以递归遍历整个结构以提取所需的信息。

示例

在这个示例中，我们加载了 HTML 文件，创建了一个 BeautifulSoup 对象，定义了一个递归函数 traverse_tags，并使用根元素（在本例中为 soup 对象）调用它。该函数打印标签名称及其内容，然后递归地为每个子元素调用自身。

HTML

myhtml.html

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
</head>
<body>
  <div class="container">
    <h1>Welcome to Tutorialspoint</h1>
    <p>Arrays </p>
    <p>Linkedin List</p>
 </div>
</body>
</html>

Python

from bs4 import BeautifulSoup

# Load the HTML file
with open('myhtml.html', 'r') as file:
    html_content = file.read()
# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'lxml')

# Define a recursive function to traverse the structure
def traverse_tags(element):
    print(element.name)
    print(element.text)
    for child in element.children:
        if child.name:
            traverse_tags(child)

# Traverse the HTML structure
traverse_tags(soup)

输出

[document]


Document


Welcome to Tutorialspoint
Arrays 
Linkedin List


html



Document

Welcome to Tutorialspoint
Arrays 
Linkedin List

head

Document
meta
meta
meta
title
Document
body

Welcome to Tutorialspoint
Arrays 
Linkedin List

div
Welcome to Tutorialspoint
Arrays 
Linkedin List
h1
Welcome to Tutorialspoint
p
Arrays 
p
Linkedin List

从链接中进行网页抓取

除了解析本地 HTML 文件外，我们还可以通过抓取网页来提取有用的信息。使用 BeautifulSoup 和 requests 等 Python 库，我们可以获取网页的 HTML 内容并提取相关数据。

语法

# Define the URL
url = 'https://tutorialspoint.com/index.htm'
# Send a GET request
response = requests.get(url)
# Create a BeautifulSoup object with the webpage content
soup = BeautifulSoup(response.content, 'lxml')

示例

在这个示例中，我们使用 requests 库向所需的网页发送了一个 GET 请求。然后，我们使用响应内容创建了一个 BeautifulSoup 对象，并使用适当的标签提取了文章标题和描述。最后，我们打印了提取的信息。

import requests
from bs4 import BeautifulSoup

# Define the URL of the webpage to scrape
url = 'https://tutorialspoint.com/index.htm'

# Send a GET request to the webpage
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    print("Fetch was successful.")
    
    # Create a BeautifulSoup object with the webpage content
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Find and print the title of the webpage
    mytitle = soup.find('title').text
    print(f"HTMl Webpage Title: {mytitle}")
    
    # Find and print the first paragraph of the content
    myparagraph = soup.find('p').text
    print(f"First Paragraph listed in the website: {myparagraph}")
    
else:
    print(f"Error code: {response.status_code}")

输出

Fetch was successful.
HTMl Webpage Title: Online Courses and eBooks Library | Tutorialspoint
First Paragraph listed in the website: Premium Courses

结论

使用 Python 解析本地 HTML 文件为数据提取和操作提供了广泛的可能性。通过修改文件、删除元素、打印数据、利用递归子生成器以及从链接中进行网页抓取，我们可以有效地从 HTML 文件中提取相关信息。Python 利用 BeautifulSoup 和 lxml 等强大的库来导航和操作 HTML 结构。有了本文中的知识和代码示例，您现在可以自信地在 Python 项目中提取和使用 HTML 文件中的数据了。

Tarun Singh

更新于: 2023年8月31日

3K+ 阅读量

开启您的职业生涯

通过完成课程获得认证

立即开始