如何使用Python解析HTML页面以获取HTML表格？

Python 服务器端编程编程

问题

您需要从网页中提取HTML表格。

介绍

互联网和万维网（WWW）是当今最重要的信息来源。信息如此之多，很难从众多选项中选择内容。大部分信息都可以通过HTTP检索。

但我们也可以通过编程方式执行这些操作，来自动检索和处理信息。

Python允许我们使用其标准库和HTTP客户端来做到这一点，但requests模块可以更轻松地获取网页信息。

在这篇文章中，我们将学习如何解析HTML页面以提取嵌入在页面中的HTML表格。

操作方法..

1.我们将使用requests、pandas、beautifulsoup4和tabulate包。如果您的系统缺少这些包，请安装它们。如果您不确定，请使用pip freeze进行验证。

import requests
import pandas as pd
from tabulate import tabulate

2.我们将使用https://tutorialspoint.com/python/python_basic_operators.htm来解析页面并打印出嵌入其中的所有HTML表格。

# set the site url
site_url = "https://tutorialspoint.com/python/python_basic_operators.htm"

3.我们将向服务器发出请求并查看响应。

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

4.响应代码200表示服务器的响应成功。因此，我们现在将检查请求头、响应头以及服务器返回的前100个文本。

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

输出

*** Printing the request headers -
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>

5.我们现在将使用BeautifulSoup解析HTML。

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

6.大多数表格的标题都定义在h2、h3、h4、h5或h6标签中。我们首先识别这些标签，然后拾取识别标签旁边的html表格。对于此逻辑，我们将使用如下所示的find、sibling和find_next_siblings。

# Find all the h3 elements
print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(tabulate(df[0], headers='keys', tablefmt='psql'))

完整代码

7.现在将所有内容整合在一起。

# STEP1 : Download the page required
import requests
import pandas as pd


# set the site url
site_url = "https://tutorialspoint.com/python/python_basic_operators.htm"

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

# Find all the h3 elements
# print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(df)

输出

*** The response for https://tutorialspoint.com/python/python_basic_operators.htm is 200
*** Printing the request headers -
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - <title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - Python - Basic Operators - Tutorialspoint
[<h2>Types of Operator</h2>, <h2>Python Arithmetic Operators</h2>, <h2>Python Comparison Operators</h2>, <h2>Python Assignment Operators</h2>, <h2>Python Bitwise Operators</h2>, <h2>Python Logical Operators</h2>, <h2>Python Membership Operators</h2>, <h2>Python Identity Operators</h2>, <h2>Python Operators Precedence</h2>]
[ Operator Description \
0 + Addition Adds values on either side of the operator.
1 - Subtraction Subtracts right hand operand from left hand op...
2 * Multiplication Multiplies values on either side of the operator
3 / Division Divides left hand operand by right hand operand
4 % Modulus Divides left hand operand by right hand operan...
5 ** Exponent Performs exponential (power) calculation on op...
6 // Floor Division - The division of operands wher...

示例

0 a + b = 30
1 a – b = -10
2 a * b = 200
3 b / a = 2
4 b % a = 0
5 a**b =10 to the power 20
6 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.... ]
[ Operator Description \
0 == If the values of two operands are equal, then ...
1 != If values of two operands are not equal, then ...
2 <> If values of two operands are not equal, then ...
3 > If the value of left operand is greater than t...
4 < If the value of left operand is less than the ...
5 >= If the value of left operand is greater than o...
6 <= If the value of left operand is less than or e...

示例

0 (a == b) is not true.
1 (a != b) is true.
2 (a <> b) is true. This is similar to != operator.
3 (a > b) is not true.
4 (a < b) is true.
5 (a >= b) is not true.
6 (a <= b) is true. ]
[ Operator Description \
0 = Assigns values from right side operands to lef...
1 += Add AND It adds right operand to the left operand and ...
2 -= Subtract AND It subtracts right operand from the left opera...
3 *= Multiply AND It multiplies right operand with the left oper...
4 /= Divide AND It divides left operand with the right operand...
5 %= Modulus AND It takes modulus using two operands and assign...
6 **= Exponent AND Performs exponential (power) calculation on op...
7 //= Floor Division It performs floor division on operators and as...

示例

0 c = a + b assigns value of a + b into c
1 c += a is equivalent to c = c + a
2 c -= a is equivalent to c = c - a
3 c *= a is equivalent to c = c * a
4 c /= a is equivalent to c = c / a
5 c %= a is equivalent to c = c % a
6 c **= a is equivalent to c = c ** a
7 c //= a is equivalent to c = c // a ]
[ Operator \
0 & Binary AND
1 | Binary OR
2 ^ Binary XOR
3 ~ Binary Ones Complement
4 << Binary Left Shift
5 >> Binary Right Shift

Description \
0 Operator copies a bit to the result if it exis...
1 It copies a bit if it exists in either operand.
2 It copies the bit if it is set in one operand ...
3 It is unary and has the effect of 'flipping' b...
4 The left operands value is moved left by the n...
5 The left operands value is moved right by the ...

示例

0 (a & b) (means 0000 1100)
1 (a | b) = 61 (means 0011 1101)
2 (a ^ b) = 49 (means 0011 0001)
3 (~a ) = -61 (means 1100 0011 in 2's complement...
4 a << 2 = 240 (means 1111 0000)
5 a >> 2 = 15 (means 0000 1111) ]
[ Operator Description \
0 and Logical AND If both the operands are true then condition b...
1 or Logical OR If any of the two operands are non-zero then c...
2 not Logical NOT Used to reverse the logical state of its operand.

Example
0 (a and b) is true.
1 (a or b) is true.
2 Not(a and b) is false. ]
[ Operator Description \
0 in Evaluates to true if it finds a variable in th...
1 not in Evaluates to true if it does not finds a varia...

示例

0 x in y, here in results in a 1 if x is a membe...
1 x not in y, here not in results in a 1 if x is... ]
[ Operator Description \
0 is Evaluates to true if the variables on either s...
1 is not Evaluates to false if the variables on either ...

示例

0 x is y, here is results in 1 if id(x) equals i...
1 x is not y, here is not results in 1 if id(x) ... ]
[ Sr.No. Operator & Description
0 1 ** Exponentiation (raise to the power)
1 2 ~ + - Complement, unary plus and minus (method...
2 3 * / % // Multiply, divide, modulo and floor di...
3 4 + - Addition and subtraction
4 5 >> << Right and left bitwise shift
5 6 & Bitwise 'AND'
6 7 ^ | Bitwise exclusive `OR' and regular `OR'
7 8 <= < > >= Comparison operators
8 9 <> == != Equality operators
9 10 = %= /= //= -= += *= **= Assignment operators
10 11 is is not Identity operators
11 12 in not in]

Kiran P

更新于：2020年11月9日

731 次浏览

启动您的职业生涯

完成课程获得认证

开始学习