Python 网络爬虫工具


在计算机科学中,网络爬虫是指从网站提取数据的过程。使用这项技术可以将网络上的非结构化数据转换为结构化数据。

Python3中最常用的网络爬虫工具包括:

  • Urllib2
  • Requests
  • BeautifulSoup
  • Lxml
  • Selenium
  • MechanicalSoup

**Urllib2** - 此工具预装在 Python 中。此模块用于提取 URL。使用 urlopen() 函数,通过不同的协议(FTP、HTTP 等)获取 URL。

示例代码

from urllib.request import urlopen
my_html = urlopen("https://tutorialspoint.com/")
print(my_html.read())

输出

b'<!DOCTYPE html<\r\n
<!--[if IE 8]<
<html class="ie ie8"<
<![endif]--<
\r\n<!--[if IE 9]<
<html class="ie ie9"<
<![endif]-->\r\n<!--[if gt IE 9]><!--<
\r\n<html lang="en-US"<
<!--<![endif]--<
\r\n<head>\r\n<!-- Basic --<
\r\n<meta charset="utf-8"<
\r\n<title>Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Apache Commons Collections</title<
\r\n<meta name="Description" content="Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Intellij Idea, Apache Commons Collections, Java 9, GSON, TestLink, Inter Process Communication (IPC), Logo, PySpark, Google Tag Manager, Free IFSC Code, SAP Workflow"/<
\r\n<meta name="Keywords" content="Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Gson, TestLink, Inter Process Communication (IPC), Logo"/<\r\n
<meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes">\r\n<link href="https://cdn.muicss.com/mui-0.9.39/extra/mui-rem.min.css" rel="stylesheet" type="text/css" /<\r\n
<link rel="stylesheet" href="/questions/css/home.css?v=3" /< \r\n
<script src="/questions/js/jquery.min.js"<
</script<
\r\n<script src="/questions/js/fontawesome.js"<
</script<\r\n
<script src="https://cdn.muicss.com/mui-0.9.39/js/mui.min.js"<
</script>\r\n
</head>\r\n
<body>\r\n
<!-- Start of Body Content --> \r\n
<div class="mui-appbar-home">\r\n
<div class="mui-container">\r\n
<div class="tp-primary-header mui-top-home">\r\n
<a href="https://tutorialspoint.com/index.htm" target="_blank" title="TutorialsPoint - Home">
<i class="fa fa-home">
</i><span>Home</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-qa">\r\n
<a href="https://tutorialspoint.com/questions/index.php" target="_blank" title="Questions & Answers - The Best Technical Questions and Answers - TutorialsPoint"><i class="fa fa-location-arrow"></i>
<span>
Q/A</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-tools">\r\n
<a href="https://tutorialspoint.com/online_dev_tools.htm" target="_blank" title="Tools - Online Development and Testing Tools">
<i class="fa fa-cogs"></i><span>Tools</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-coding-ground">\r\n
<a href="https://tutorialspoint.com/codingground.htm" target="_blank" title="Coding Ground - Free Online IDE and Terminal">
<i class="fa fa-code">
</i>
<span>
Coding Ground </span>
</a> \r\n
</div>\r\n
<div class="tp-primary-header mui-top-current-affairs">\r\n
<a href="https://tutorialspoint.com/current_affairs/index.htm" target="_blank" title="Current Affairs - 2016, 2017 and 2018 | General Knowledge for Competitive Exams"><i class="fa fa-globe">
</i><span>Current Affairs</span>
</a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-upsc">\r\n
<a href="https://tutorialspoint.com/upsc_ias_exams.htm" target="_blank" title="UPSC IAS Exams Notes - TutorialsPoint"><i class="fa fa-user-tie"></i><span>UPSC Notes</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-tutors">\r\n
<a href="https://tutorialspoint.com/tutor_connect/index.php" target="_blank" title="Top Online Tutors - Tutor Connect">
<i class="fa fa-user">
</i>
<span>Online Tutors</span>
</a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-examples">\r\n
….

**Requests** - 此模块未预安装,需要在命令提示符中输入命令进行安装。**Requests** 发送 HTTP/1.1 请求。

pip install requests

示例

import requests
# get URL
my_req = requests.get('https://tutorialspoint.com/')
   print(my_req.encoding)
   print(my_req.status_code)
   print(my_req.elapsed)
   print(my_req.url)
   print(my_req.history)
print(my_req.headers['Content-Type'])

输出

UTF-8
200
0:00:00.205727
https://tutorialspoint.com/
[]
text/html; charset=UTF-8

**BeautifulSoup** - 这是一个解析库,用于不同的解析器。Python 的标准库提供 BeautifulSoup 的默认解析器。它构建一个解析树,用于从 HTML 页面提取数据。

要安装此模块,请在命令提示符中输入以下命令:

pip install beautifulsoup4

示例

from bs4 import BeautifulSoup
# importing requests
import requests
# get URL
my_req = requests.get("https://tutorialspoint.com/")
my_data = my_req.text
my_soup = BeautifulSoup(my_data)
for my_link in my_soup.find_all('a'):
print(my_link.get('href'))

输出

https://tutorialspoint.com/index.htm
https://tutorialspoint.com/questions/index.php
https://tutorialspoint.com/online_dev_tools.htm
https://tutorialspoint.com/codingground.htm
https://tutorialspoint.com/current_affairs/index.htm
https://tutorialspoint.com/upsc_ias_exams.htm
https://tutorialspoint.com/tutor_connect/index.php
https://tutorialspoint.com/programming_examples/
https://tutorialspoint.com/whiteboard.htm
https://tutorialspoint.com/netmeeting.php
https://tutorialspoint.com/articles/
https://tutorialspoint.com/index.htm
https://tutorialspoint.com/tutorialslibrary.htm
https://tutorialspoint.com/videotutorials/index.htm
https://store.tutorialspoint.com
https://tutorialspoint.com/html_online_training/index.asp
https://tutorialspoint.com/css_online_training/index.asp
https://tutorialspoint.com/3d_animation_online_training/index.asp
https://tutorialspoint.com/swift_4_online_training/index.asp
https://tutorialspoint.com/blockchain_online_training/index.asp
https://tutorialspoint.com/reactjs_online_training/index.asp
https://tutorialspoint.com/tutorialslibrary.htm
https://tutorialspoint.com/computer_fundamentals/index.htm
https://tutorialspoint.com/compiler_design/index.htm
https://tutorialspoint.com/operating_system/index.htm
https://tutorialspoint.com/data_structures_algorithms/index.htm
https://tutorialspoint.com/dbms/index.htm
https://tutorialspoint.com/data_communication_computer_network/index.htm
https://tutorialspoint.com/academic_tutorials.htm
https://tutorialspoint.com/html/index.htm
https://tutorialspoint.com/css/index.htm
https://tutorialspoint.com/javascript/index.htm
https://tutorialspoint.com/php/index.htm
https://tutorialspoint.com/angular4/index.htm
https://tutorialspoint.com/mysql/index.htm
https://tutorialspoint.com/web_development_tutorials.htm
https://tutorialspoint.com/cprogramming/index.htm
https://tutorialspoint.com/cplusplus/index.htm
https://tutorialspoint.com/java8/index.htm
https://tutorialspoint.com/python/index.htm
https://tutorialspoint.com/scala/index.htm
https://tutorialspoint.com/csharp/index.htm
https://tutorialspoint.com/computer_programming_tutorials.htm
https://tutorialspoint.com/java8/index.htm
https://tutorialspoint.com/jdbc/index.htm
https://tutorialspoint.com/servlets/index.htm
https://tutorialspoint.com/spring/index.htm
https://tutorialspoint.com/hibernate/index.htm
https://tutorialspoint.com/swing/index.htm
https://tutorialspoint.com/java_technology_tutorials.htm
https://tutorialspoint.com/android/index.htm
https://tutorialspoint.com/swift/index.htm
https://tutorialspoint.com/ios/index.htm
https://tutorialspoint.com/kotlin/index.htm
https://tutorialspoint.com/react_native/index.htm
https://tutorialspoint.com/xamarin/index.htm
https://tutorialspoint.com/mobile_development_tutorials.htm
https://tutorialspoint.com/mongodb/index.htm
https://tutorialspoint.com/plsql/index.htm
https://tutorialspoint.com/sql/index.htm
https://tutorialspoint.com/db2/index.htm
https://tutorialspoint.com/mysql/index.htm
https://tutorialspoint.com/memcached/index.htm
https://tutorialspoint.com/database_tutorials.htm
https://tutorialspoint.com/asp.net/index.htm
https://tutorialspoint.com/entity_framework/index.htm
https://tutorialspoint.com/vb.net/index.htm
https://tutorialspoint.com/ms_project/index.htm
https://tutorialspoint.com/excel/index.htm
https://tutorialspoint.com/word/index.htm
https://tutorialspoint.com/microsoft_technologies_tutorials.htm
https://tutorialspoint.com/big_data_analytics/index.htm
https://tutorialspoint.com/hadoop/index.htm
https://tutorialspoint.com/sas/index.htm
https://tutorialspoint.com/qlikview/index.htm
https://tutorialspoint.com/power_bi/index.htm
https://tutorialspoint.com/tableau/index.htm
https://tutorialspoint.com/big_data_tutorials.htm
https://tutorialspoint.com/tutorialslibrary.htm
https://tutorialspoint.com/codingground.htm
https://tutorialspoint.com/coding_platform_for_websites.htm
https://tutorialspoint.com/developers_best_practices/index.htm
https://tutorialspoint.com/effective_resume_writing.htm
https://tutorialspoint.com/computer_glossary.htm
https://tutorialspoint.com/computer_whoiswho.htm
https://tutorialspoint.com/questions_and_answers.htm
https://tutorialspoint.com/multi_language_tutorials.htm
https://itunes.apple.com/us/app/tutorials-point/id914891263?ls=1&mt=8
https://play.google.com/store/apps/details?id=com.tutorialspoint.onlineviewer
http://www.windowsphone.com/s?appid=91249671-7184-4ad6-8a5f-d11847946b09
/about/index.htm
/about/about_team.htm
/about/about_careers.htm
/about/about_privacy.htm
/about/about_terms_of_use.htm
https://tutorialspoint.com/articles/
https://tutorialspoint.com/online_dev_tools.htm
https://tutorialspoint.com/free_web_graphics.htm
https://tutorialspoint.com/online_file_conversion.htm
https://tutorialspoint.com/shared-tutorials.php
https://tutorialspoint.com/netmeeting.php
https://tutorialspoint.com/free_online_whiteboard.htm
https://tutorialspoint.com
https://www.facebook.com/tutorialspointindia
https://plus.google.com/u/0/+tutorialspoint
http://www.twitter.com/tutorialspoint
http://www.linkedin.com/company/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://tutorialspoint.com/index.htm
/about/about_privacy.htm#cookies
/about/faq.htm
/about/about_helping.htm
/about/contact_us.htm

**Lxml** - 这是一个高性能、生产级的 HTML 和 XML 解析库。如果需要高质量和最大速度,则应使用此库。它有很多模块可以从网站提取数据。

安装方法:在命令提示符中输入

pip install lxml

示例

from lxml import etree
my_root_elem = etree.Element('html')
etree.SubElement(my_root_elem, 'head')
etree.SubElement(my_root_elem, 'title')
etree.SubElement(my_root_elem, 'body')
print(etree.tostring(my_root_elem, pretty_print = True).decode("utf-8"))

输出

<html>
<head/>
<title/>
<body/>
</html>

**Selenium** - 这是一种自动化浏览器工具,也称为 web 驱动程序。当使用任何网站时,有时需要等待一段时间,例如单击按钮或滚动页面时,此时需要 Selenium。

安装 Selenium 使用以下命令:

pip install selenium

示例

from selenium import webdriver
my_path_to_chromedriver ='/Users/Admin/Desktop/chromedriver'
my_browser = webdriver.Chrome(executable_path = my_path_to_chromedriver)
my_url = 'https://tutorialspoint.com/'
my_browser.get(my_url)

输出

tutorialspoint

**MechanicalSoup** - 这是另一个用于自动化与网站交互的 Python 库。使用它可以自动存储和发送 Cookie,可以跟踪重定向,可以跟踪链接并提交表单。它不执行 JavaScript。

安装方法:使用以下命令

pip install MechanicalSoup

示例

import mechanicalsoup
my_browser = mechanicalsoup.StatefulBrowser()
my_value = my_browser.open("https://tutorialspoint.com/")
print(my_value)
my_val = my_browser.get_url()
print(my_val)
my_va = my_browser.follow_link("forms")
print(my_va)
my_value1 = my_browser.get_url()
print(my_value1)

更新于:2020年6月26日

260 次浏览

开启你的职业生涯

完成课程获得认证

开始学习
广告