Beautiful Soup - 页面解析



现在,我们将在其中一个 html 页面中测试我们的 Beautiful Soup 包(以网页 https://tutorialspoint.com/index.htm 为例,您可以选择任何其他网页)并从中提取一些信息。

在下面的代码中,我们尝试从网页中提取标题 -

示例

from bs4 import BeautifulSoup
import requests


url = "https://tutorialspoint.com/index.htm"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

print(soup.title)

输出

<title>Online Courses and eBooks Library<title>

一个常见的任务是从网页中提取所有 URL。为此,我们只需要添加以下代码行 -

for link in soup.find_all('a'):
   print(link.get('href'))

输出

以下是上述循环的部分输出 -

https://tutorialspoint.com/index.htm
https://tutorialspoint.com/codingground.htm
https://tutorialspoint.com/about/about_careers.htm
https://tutorialspoint.com/whiteboard.htm
https://tutorialspoint.com/online_dev_tools.htm
https://tutorialspoint.com/business/index.asp
https://tutorialspoint.com/market/teach_with_us.jsp
https://#/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://tutorialspoint.com/categories/development
https://tutorialspoint.com/categories/it_and_software
https://tutorialspoint.com/categories/data_science_and_ai_ml
https://tutorialspoint.com/categories/cyber_security
https://tutorialspoint.com/categories/marketing
https://tutorialspoint.com/categories/office_productivity
https://tutorialspoint.com/categories/business
https://tutorialspoint.com/categories/lifestyle
https://tutorialspoint.com/latest/prime-packs
https://tutorialspoint.com/market/index.asp
https://tutorialspoint.com/latest/ebooks
…
…

要解析当前工作目录中本地存储的网页,请获取指向 html 文件的文件对象,并将其用作 BeautifulSoup() 构造函数的参数。

示例

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

print(soup)

输出

<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>

您还可以使用包含 HTML 脚本的字符串作为构造函数的参数,如下所示 -

from bs4 import BeautifulSoup

html = '''
<html>
   <head>
      <title>Hello World</title>
   </head>
   <body>
      <h1 style="text-align:center;">Hello World</h1>
   </body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')

print(soup)

Beautiful Soup 使用最佳可用的解析器来解析文档。除非另有指定,否则它将使用 HTML 解析器。

广告

© . All rights reserved.