如何使用Python解析HTML页面以获取HTML表格?
问题
您需要从网页中提取HTML表格。
介绍
互联网和万维网(WWW)是当今最重要的信息来源。信息如此之多,很难从众多选项中选择内容。大部分信息都可以通过HTTP检索。
但我们也可以通过编程方式执行这些操作,来自动检索和处理信息。
Python允许我们使用其标准库和HTTP客户端来做到这一点,但requests模块可以更轻松地获取网页信息。
在这篇文章中,我们将学习如何解析HTML页面以提取嵌入在页面中的HTML表格。
操作方法..
1.我们将使用requests、pandas、beautifulsoup4和tabulate包。如果您的系统缺少这些包,请安装它们。如果您不确定,请使用pip freeze进行验证。
import requests import pandas as pd from tabulate import tabulate
2.我们将使用https://tutorialspoint.com/python/python_basic_operators.htm来解析页面并打印出嵌入其中的所有HTML表格。
# set the site url site_url = "https://tutorialspoint.com/python/python_basic_operators.htm"
3.我们将向服务器发出请求并查看响应。
# Make a request to the server
response = requests.get(site_url)
# Check the response
print(f"*** The response for {site_url} is {response.status_code}")4.响应代码200表示服务器的响应成功。因此,我们现在将检查请求头、响应头以及服务器返回的前100个文本。
# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")
# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")
# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")输出
*** Printing the request headers -
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>5.我们现在将使用BeautifulSoup解析HTML。
# Parse the HTML pages
from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")
# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")6.大多数表格的标题都定义在h2、h3、h4、h5或h6标签中。我们首先识别这些标签,然后拾取识别标签旁边的html表格。对于此逻辑,我们将使用如下所示的find、sibling和find_next_siblings。
# Find all the h3 elements
print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(tabulate(df[0], headers='keys', tablefmt='psql'))完整代码
7.现在将所有内容整合在一起。
# STEP1 : Download the page required
import requests
import pandas as pd
# set the site url
site_url = "https://tutorialspoint.com/python/python_basic_operators.htm"
# Make a request to the server
response = requests.get(site_url)
# Check the response
print(f"*** The response for {site_url} is {response.status_code}")
# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")
# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")
# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")
# Parse the HTML pages
from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")
# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")
# Find all the h3 elements
# print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(df)输出
*** The response for https://tutorialspoint.com/python/python_basic_operators.htm is 200
*** Printing the request headers -
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - <title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - Python - Basic Operators - Tutorialspoint
[<h2>Types of Operator</h2>, <h2>Python Arithmetic Operators</h2>, <h2>Python Comparison Operators</h2>, <h2>Python Assignment Operators</h2>, <h2>Python Bitwise Operators</h2>, <h2>Python Logical Operators</h2>, <h2>Python Membership Operators</h2>, <h2>Python Identity Operators</h2>, <h2>Python Operators Precedence</h2>]
[ Operator Description \
0 + Addition Adds values on either side of the operator.
1 - Subtraction Subtracts right hand operand from left hand op...
2 * Multiplication Multiplies values on either side of the operator
3 / Division Divides left hand operand by right hand operand
4 % Modulus Divides left hand operand by right hand operan...
5 ** Exponent Performs exponential (power) calculation on op...
6 // Floor Division - The division of operands wher...示例
0 a + b = 30 1 a – b = -10 2 a * b = 200 3 b / a = 2 4 b % a = 0 5 a**b =10 to the power 20 6 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.... ] [ Operator Description \ 0 == If the values of two operands are equal, then ... 1 != If values of two operands are not equal, then ... 2 <> If values of two operands are not equal, then ... 3 > If the value of left operand is greater than t... 4 < If the value of left operand is less than the ... 5 >= If the value of left operand is greater than o... 6 <= If the value of left operand is less than or e...
示例
0 (a == b) is not true. 1 (a != b) is true. 2 (a <> b) is true. This is similar to != operator. 3 (a > b) is not true. 4 (a < b) is true. 5 (a >= b) is not true. 6 (a <= b) is true. ] [ Operator Description \ 0 = Assigns values from right side operands to lef... 1 += Add AND It adds right operand to the left operand and ... 2 -= Subtract AND It subtracts right operand from the left opera... 3 *= Multiply AND It multiplies right operand with the left oper... 4 /= Divide AND It divides left operand with the right operand... 5 %= Modulus AND It takes modulus using two operands and assign... 6 **= Exponent AND Performs exponential (power) calculation on op... 7 //= Floor Division It performs floor division on operators and as...
示例
0 c = a + b assigns value of a + b into c 1 c += a is equivalent to c = c + a 2 c -= a is equivalent to c = c - a 3 c *= a is equivalent to c = c * a 4 c /= a is equivalent to c = c / a 5 c %= a is equivalent to c = c % a 6 c **= a is equivalent to c = c ** a 7 c //= a is equivalent to c = c // a ] [ Operator \ 0 & Binary AND 1 | Binary OR 2 ^ Binary XOR 3 ~ Binary Ones Complement 4 << Binary Left Shift 5 >> Binary Right Shift Description \ 0 Operator copies a bit to the result if it exis... 1 It copies a bit if it exists in either operand. 2 It copies the bit if it is set in one operand ... 3 It is unary and has the effect of 'flipping' b... 4 The left operands value is moved left by the n... 5 The left operands value is moved right by the ...
示例
0 (a & b) (means 0000 1100) 1 (a | b) = 61 (means 0011 1101) 2 (a ^ b) = 49 (means 0011 0001) 3 (~a ) = -61 (means 1100 0011 in 2's complement... 4 a << 2 = 240 (means 1111 0000) 5 a >> 2 = 15 (means 0000 1111) ] [ Operator Description \ 0 and Logical AND If both the operands are true then condition b... 1 or Logical OR If any of the two operands are non-zero then c... 2 not Logical NOT Used to reverse the logical state of its operand. Example 0 (a and b) is true. 1 (a or b) is true. 2 Not(a and b) is false. ] [ Operator Description \ 0 in Evaluates to true if it finds a variable in th... 1 not in Evaluates to true if it does not finds a varia...
示例
0 x in y, here in results in a 1 if x is a membe... 1 x not in y, here not in results in a 1 if x is... ] [ Operator Description \ 0 is Evaluates to true if the variables on either s... 1 is not Evaluates to false if the variables on either ...
示例
0 x is y, here is results in 1 if id(x) equals i... 1 x is not y, here is not results in 1 if id(x) ... ] [ Sr.No. Operator & Description 0 1 ** Exponentiation (raise to the power) 1 2 ~ + - Complement, unary plus and minus (method... 2 3 * / % // Multiply, divide, modulo and floor di... 3 4 + - Addition and subtraction 4 5 >> << Right and left bitwise shift 5 6 & Bitwise 'AND' 6 7 ^ | Bitwise exclusive `OR' and regular `OR' 7 8 <= < > >= Comparison operators 8 9 <> == != Equality operators 9 10 = %= /= //= -= += *= **= Assignment operators 10 11 is is not Identity operators 11 12 in not in]
广告
数据结构
网络
关系数据库管理系统(RDBMS)
操作系统
Java
iOS
HTML
CSS
Android
Python
C语言编程
C++
C#
MongoDB
MySQL
Javascript
PHP