Python Pandas - 处理 HTML 数据

Pandas 库提供了广泛的功能来处理来自各种格式的数据。其中一种格式是 HTML（超文本标记语言），它是一种常用于构建网页内容的格式。HTML 文件可能包含表格数据，可以使用 Pandas 库提取和分析这些数据。

HTML 表格是一种结构化格式，用于在网页中以行和列的形式表示表格数据。可以通过使用 **pandas.read_html()** 函数从 HTML 中提取此表格数据。也可以使用 **DataFrame.to_html()** 方法将 Pandas DataFrame 写回 HTML 表格。

在本教程中，我们将学习如何使用 Pandas 处理 HTML 数据，包括读取 HTML 表格以及将 Pandas DataFrame 写入 HTML 表格。

从 URL 读取 HTML 表格

**pandas.read_html()** 函数用于从 HTML 文件、字符串或 URL 读取表格。它会自动解析 HTML 中的 <table> 元素，并返回一个 **pandas.DataFrame** 对象列表。

示例

以下是使用 **pandas.read_html()** 函数从 URL 读取数据的基本示例。

import pandas as pd

# Read tables from a SQL tutorial
url = "https://tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url)

# Access the first table from the URL
df = tables[0]

# Display the resultant DataFrame
print('Output First DataFrame:', df.head())

以下是上述代码的输出 -

Output First DataFrame:



ID
NAME
AGE
ADDRESS
SALARY


0
1
Ramesh
32
Ahmedabad
2000.0


1
2
Khilan
25
Delhi
1500.0


2
3
Kaushik
23
Kota
2000.0


3
4
Chaitali
25
Mumbai
6500.0


4
5
Hardik
27
Bhopal
8500.0

	ID	NAME	AGE	ADDRESS	SALARY
0	1	Ramesh	32	Ahmedabad	2000.0
1	2	Khilan	25	Delhi	1500.0
2	3	Kaushik	23	Kota	2000.0
3	4	Chaitali	25	Mumbai	6500.0
4	5	Hardik	27	Bhopal	8500.0

从字符串读取 HTML 数据

可以通过使用 Python 的 **io.StringIO** 模块直接从字符串读取 HTML 数据。

示例

以下示例演示了如何在不保存到文件的情况下使用 StringIO 读取 HTML 字符串。

import pandas as pd
from io import StringIO

# Create a HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Read the HTML string
dfs = pd.read_html(StringIO(html_str))
print(dfs[0])

以下是上述代码的输出 -




C1
C2
C3


0
a
b
c


1
x
y
z

	C1	C2	C3
0	a	b	c
1	x	y	z

示例

这是在不使用 **io.StringIO** 模块的情况下读取 HTML 字符串的另一种方法。在这里，我们将 HTML 字符串保存到一个临时文件中，然后使用 **pd.read_html()** 函数读取它。

import pandas as pd

# Create a HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Save to a temporary file and read
with open("temp.html", "w") as f:
    f.write(html_str)

df = pd.read_html("temp.html")[0]
print(df)

以下是上述代码的输出 -




C1
C2
C3


0
a
b
c


1
x
y
z

	C1	C2	C3
0	a	b	c
1	x	y	z

处理来自 HTML 文件的多个表格

在读取包含多个表格的 HTML 文件时，我们可以使用 **pd.read_html()** 函数的 **match** 参数来读取具有特定文本的表格。

示例

以下示例使用 **match** 参数从包含多个表格的 HTML 文件中读取具有特定文本的表格。

import pandas as pd

# Read tables from a SQL tutorial
url = "https://tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url, match='Field')

# Access the table
df = tables[0]
print(df.head())

以下是上述代码的输出 -




Field
Type
Null
Key
Default
Extra


1
ID
int(11)
NO
PRI
NaN
NaN


2
NAME
varchar(20)
NO
NaN
NaN
NaN


3
AGE
int(11)
NO
NaN
NaN
NaN


4
ADDRESS
char(25)
YES
NaN
NaN
NaN


5
SALARY
decimal(18,2)
YES
NaN
NaN
NaN

	Field	Type	Null	Key	Default	Extra
1	ID	int(11)	NO	PRI	NaN	NaN
2	NAME	varchar(20)	NO	NaN	NaN	NaN
3	AGE	int(11)	NO	NaN	NaN	NaN
4	ADDRESS	char(25)	YES	NaN	NaN	NaN
5	SALARY	decimal(18,2)	YES	NaN	NaN	NaN

将 DataFrame 写入 HTML

Pandas DataFrame 对象可以使用 **DataFrame.to_html()** 方法转换为 HTML 表格。如果 **buf** 参数设置为 None，则此方法将返回一个字符串。

示例

以下示例演示了如何使用 **DataFrame.to_html()** 方法将 Pandas DataFrame 写入 HTML 表格。

import pandas as pd

# Create a DataFrame
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

# Convert the DataFrame to HTML table
html = df.to_html()

# Display the HTML string
print(html)

以下是上述代码的输出 -

<table border="1" class="dataframe">
   <thead>
      <tr style="text-align: right;">
         <th></th>
         <th>A</th>
         <th>B</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <th>0</th>
         <td>1</td>
         <td>2</td>
      </tr>
      <tr>
         <th>1</th>
         <td>3</td>
         <td>4</td>
      </tr>
   </tbody>
</table>

打印页面