使用BeautifulSoup在Python中实现网页抓取?


BeautifulSoup是Python的bs4模块中的一个类。构建BeautifulSoup的基本目的是解析HTML或XML文档。

安装bs4(简称BeautifulSoup)

使用pip模块很容易安装BeautifulSoup。只需在命令行中运行以下命令。

pip install bs4

在您的终端运行上述命令,您将看到类似于以下内容的屏幕 -

C:\Users\rajesh>pip install bs4
Collecting bs4
Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in c:\python\python361\lib\site-packages (from bs4) (4.6.0)
Building wheels for collected packages: bs4
Building wheel for bs4 (setup.py) ... done
Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1

要验证BeautifulSoup是否已成功安装在您的机器上,只需在同一终端中运行以下命令:

C:\Users\rajesh>python
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>>

成功,太棒了!

示例1

从HTML文档中查找所有链接 现在,假设我们有一个HTML文档,我们想要收集文档中的所有参考链接。所以首先我们将文档存储为如下所示的字符串:

html_doc='''<a href='wwww.Tutorialspoint.com.com'/a>
<a href='wwww.nseindia.com.com'/a>
<a href='wwww.codesdope.com'/a>
<a href='wwww.google.com'/a>
<a href='wwww.facebook.com'/a>
<a href='wwww.wikipedia.org'/a>
<a href='wwww.twitter.com'/a>
<a href='wwww.microsoft.com'/a>
<a href='wwww.github.com'/a>
<a href='wwww.nytimes.com'/a>
<a href='wwww.youtube.com'/a>
<a href='wwww.reddit.com'/a>
<a href='wwww.python.org'/a>
<a href='wwww.stackoverflow.com'/a>
<a href='wwww.amazon.com'/a>
<a href=‘wwww.linkedin.com'/a>
<a href='wwww.finace.google.com'/a>'''

现在,我们将通过将上述变量html_doc传递到BeautifulSoup的初始化函数中来创建一个soup对象。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

现在我们有了soup对象,我们可以在其上应用BeautifulSoup类的各种方法。现在我们可以找到标签的所有属性以及html_doc中给定属性中的值。

for tag in soup.find_all('a'):
print(tag.get('href'))

从上面的代码中,我们试图通过循环来获取html_doc字符串中的所有链接,以获取文档中的每个<a>标签并获取href属性。

以下是我们从html_doc字符串中获取所有链接的完整代码。

from bs4 import BeautifulSoup

html_doc='''<a href='www.Tutorialspoint.com'/a>
<a href='www.nseindia.com.com'/a>
<a href='www.codesdope.com'/a>
<a href='www.google.com'/a>
<a href='www.facebook.com'/a>
<a href='www.wikipedia.org'/a>
<a href='www.twitter.com'/a>
<a href='www.microsoft.com'/a>
<a href='www.github.com'/a>
<a href='www.nytimes.com'/a>
<a href='www.youtube.com'/a>
<a href='www.reddit.com'/a>
<a href='www.python.org'/a>
<a href='www.stackoverflow.com'/a>
<a href='www.amazon.com'/a>
<a href='www.rediff.com'/a>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all('a'):
print(tag.get('href'))

结果

www.Tutorialspoint.com
www.nseindia.com.com
www.codesdope.com
www.google.com
www.facebook.com
www.wikipedia.org
www.twitter.com
www.microsoft.com
www.github.com
www.nytimes.com
www.youtube.com
www.reddit.com
www.python.org
www.stackoverflow.com
www.amazon.com
www.rediff.com

示例2

打印来自包含特定元素(例如:python)的链接的网站的所有链接。

下面的程序将打印来自特定网站的所有URL,这些URL的链接中包含“python”。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html = urlopen("https://pythonlang.cn")
content = html.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
if re.findall('python', a['href']):
print("Python URL:", a['href'])

结果

Python URL: https://docs.pythonlang.cn
Python URL: https://pypi.python.org/
Python URL: https://#/pythonlang?fref=ts
Python URL: http://brochure.getpython.info/
Python URL: https://docs.pythonlang.cn/3/license.html
Python URL: https://wiki.python.org/moin/BeginnersGuide
Python URL: https://devguide.pythonlang.cn/
Python URL: https://docs.pythonlang.cn/faq/
Python URL: http://wiki.python.org/moin/Languages
Python URL: https://pythonlang.cn/dev/peps/
Python URL: https://wiki.python.org/moin/PythonBooks
Python URL: https://wiki.python.org/moin/
Python URL: https://pythonlang.cn/psf/codeofconduct/
Python URL: http://planetpython.org/
Python URL: /events/python-events
Python URL: /events/python-user-group/
Python URL: /events/python-events/past/
Python URL: /events/python-user-group/past/
Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
Python URL: //docs.pythonlang.cn/3/tutorial/controlflow.html#defining-functions
Python URL: //docs.pythonlang.cn/3/tutorial/introduction.html#lists
Python URL: https://docs.pythonlang.cn/3/tutorial/introduction.html#using-python-as-a-calculator
Python URL: //docs.pythonlang.cn/3/tutorial/
Python URL: //docs.pythonlang.cn/3/tutorial/controlflow.html
Python URL: /downloads/release/python-373/
Python URL: https://docs.pythonlang.cn
Python URL: //jobs.python.org
Python URL: http://blog.python.org
Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/Joo0vg55HKo/python-373-is-now-available.html
Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/N5tvkDIQ47g/python-3410-is-now-available.html
Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/n0mOibtx6_A/python-3.html
Python URL: /events/python-events/805/
Python URL: /events/python-events/817/
Python URL: /events/python-user-group/814/
Python URL: /events/python-events/789/
Python URL: /events/python-events/831/
Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/
Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/
Python URL: http://wiki.python.org/moin/TkInter
Python URL: http://www.wxpython.org/
Python URL: https://ipython.pythonlang.cn
Python URL: #python-network
Python URL: http://brochure.getpython.info/
Python URL: https://docs.pythonlang.cn/3/license.html
Python URL: https://wiki.python.org/moin/BeginnersGuide
Python URL: https://devguide.pythonlang.cn/
Python URL: https://docs.pythonlang.cn/faq/
Python URL: http://wiki.python.org/moin/Languages
Python URL: https://pythonlang.cn/dev/peps/
Python URL: https://wiki.python.org/moin/PythonBooks
Python URL: https://wiki.python.org/moin/
Python URL: https://pythonlang.cn/psf/codeofconduct/
Python URL: http://planetpython.org/
Python URL: /events/python-events
Python URL: /events/python-user-group/
Python URL: /events/python-events/past/
Python URL: /events/python-user-group/past/
Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
Python URL: https://devguide.pythonlang.cn/
Python URL: https://bugs.python.org/
Python URL: https://mail.python.org/mailman/listinfo/python-dev
Python URL: #python-network
Python URL: https://github.com/python/pythondotorg/issues
Python URL: https://status.python.org/

更新于:2019年7月30日

224 次浏览

启动您的职业生涯

完成课程获得认证

开始学习
广告
© . All rights reserved.