BeautifulSoup - 从 HTML 中抓取链接

在从网站资源中抓取和分析内容时，您经常需要提取某个页面包含的所有链接。在本章中，我们将了解如何从 HTML 文档中提取链接。

HTML 使用锚标记 <a> 插入超链接。锚标记的 href 属性允许您建立链接。它使用以下语法：

<a href=="web page URL">hypertext</a>

使用 find_all() 方法，我们可以收集文档中的所有锚标记，然后打印每个标记的 href 属性的值。

在下面的示例中，我们提取了 Google 首页上找到的所有链接。我们使用 requests 库来收集 https://google.com 的 HTML 内容，将其解析为 soup 对象，然后收集所有 <a> 标签。最后，我们打印 href 属性。

示例

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com/"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
   print (link)

以下是运行上述程序时部分输出：

输出

https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://#/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ
/advanced_search?hl=en-IN&authuser=0
https://www.google.com/url?q=https://io.google/2023/%3Futm_source%3Dgoogle-hpp%26utm_medium%3Dembedded_marketing%26utm_campaign%3Dhpp_watch_live%26utm_content%3D&source=hpp&id=19035434&ct=3&usg=AOvVaw0qzqTkP5AEv87NM-MUDd_u&sa=X&ved=0ahUKEwiPzpjku-z-AhU1qJUCHVmqDJoQ8IcBCAU

但是，HTML 文档可能具有不同协议方案的超链接，例如用于链接到电子邮件 ID 的 mailto: 协议，用于链接到电话号码的 tel: 方案，或用于链接到具有 file:// URL 方案的本地文件的链接。在这种情况下，如果我们有兴趣提取具有 https:// 方案的链接，我们可以通过以下示例来实现。我们有一个包含不同类型超链接的 HTML 文档，其中仅提取了具有 https:// 前缀的链接。

html = '''
<p><a href="https://tutorialspoint.com">Web page link </a></p>
<p><a href="https://www.example.com">Web page link </a></p>
<p><a href="mailto:nowhere@mozilla.org">Email link</a></p>
<p><a href="tel:+4733378901">Telephone link</a></p>
'''
from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
   if link.startswith("https"):
      print (link)

输出

https://tutorialspoint.com
https://www.example.com

打印页面