Python 网络爬虫工具
在计算机科学中,网络爬虫是指从网站提取数据的过程。使用这项技术可以将网络上的非结构化数据转换为结构化数据。
Python3中最常用的网络爬虫工具包括:
- Urllib2
- Requests
- BeautifulSoup
- Lxml
- Selenium
- MechanicalSoup
**Urllib2** - 此工具预装在 Python 中。此模块用于提取 URL。使用 urlopen() 函数,通过不同的协议(FTP、HTTP 等)获取 URL。
示例代码
from urllib.request import urlopen my_html = urlopen("https://tutorialspoint.com/") print(my_html.read())
输出
b'<!DOCTYPE html<\r\n <!--[if IE 8]< <html class="ie ie8"< <![endif]--< \r\n<!--[if IE 9]< <html class="ie ie9"< <![endif]-->\r\n<!--[if gt IE 9]><!--< \r\n<html lang="en-US"< <!--<![endif]--< \r\n<head>\r\n<!-- Basic --< \r\n<meta charset="utf-8"< \r\n<title>Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Apache Commons Collections</title< \r\n<meta name="Description" content="Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Intellij Idea, Apache Commons Collections, Java 9, GSON, TestLink, Inter Process Communication (IPC), Logo, PySpark, Google Tag Manager, Free IFSC Code, SAP Workflow"/< \r\n<meta name="Keywords" content="Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Gson, TestLink, Inter Process Communication (IPC), Logo"/<\r\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes">\r\n<link href="https://cdn.muicss.com/mui-0.9.39/extra/mui-rem.min.css" rel="stylesheet" type="text/css" /<\r\n <link rel="stylesheet" href="/questions/css/home.css?v=3" /< \r\n <script src="/questions/js/jquery.min.js"< </script< \r\n<script src="/questions/js/fontawesome.js"< </script<\r\n <script src="https://cdn.muicss.com/mui-0.9.39/js/mui.min.js"< </script>\r\n </head>\r\n <body>\r\n <!-- Start of Body Content --> \r\n <div class="mui-appbar-home">\r\n <div class="mui-container">\r\n <div class="tp-primary-header mui-top-home">\r\n <a href="https://tutorialspoint.com/index.htm" target="_blank" title="TutorialsPoint - Home"> <i class="fa fa-home"> </i><span>Home</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-qa">\r\n <a href="https://tutorialspoint.com/questions/index.php" target="_blank" title="Questions & Answers - The Best Technical Questions and Answers - TutorialsPoint"><i class="fa fa-location-arrow"></i> <span> Q/A</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-tools">\r\n <a href="https://tutorialspoint.com/online_dev_tools.htm" target="_blank" title="Tools - Online Development and Testing Tools"> <i class="fa fa-cogs"></i><span>Tools</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-coding-ground">\r\n <a href="https://tutorialspoint.com/codingground.htm" target="_blank" title="Coding Ground - Free Online IDE and Terminal"> <i class="fa fa-code"> </i> <span> Coding Ground </span> </a> \r\n </div>\r\n <div class="tp-primary-header mui-top-current-affairs">\r\n <a href="https://tutorialspoint.com/current_affairs/index.htm" target="_blank" title="Current Affairs - 2016, 2017 and 2018 | General Knowledge for Competitive Exams"><i class="fa fa-globe"> </i><span>Current Affairs</span> </a>\r\n </div>\r\n <div class="tp-primary-header mui-top-upsc">\r\n <a href="https://tutorialspoint.com/upsc_ias_exams.htm" target="_blank" title="UPSC IAS Exams Notes - TutorialsPoint"><i class="fa fa-user-tie"></i><span>UPSC Notes</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-tutors">\r\n <a href="https://tutorialspoint.com/tutor_connect/index.php" target="_blank" title="Top Online Tutors - Tutor Connect"> <i class="fa fa-user"> </i> <span>Online Tutors</span> </a>\r\n </div>\r\n <div class="tp-primary-header mui-top-examples">\r\n ….
**Requests** - 此模块未预安装,需要在命令提示符中输入命令进行安装。**Requests** 发送 HTTP/1.1 请求。
pip install requests
示例
import requests # get URL my_req = requests.get('https://tutorialspoint.com/') print(my_req.encoding) print(my_req.status_code) print(my_req.elapsed) print(my_req.url) print(my_req.history) print(my_req.headers['Content-Type'])
输出
UTF-8 200 0:00:00.205727 https://tutorialspoint.com/ [] text/html; charset=UTF-8
**BeautifulSoup** - 这是一个解析库,用于不同的解析器。Python 的标准库提供 BeautifulSoup 的默认解析器。它构建一个解析树,用于从 HTML 页面提取数据。
要安装此模块,请在命令提示符中输入以下命令:
pip install beautifulsoup4
示例
from bs4 import BeautifulSoup # importing requests import requests # get URL my_req = requests.get("https://tutorialspoint.com/") my_data = my_req.text my_soup = BeautifulSoup(my_data) for my_link in my_soup.find_all('a'): print(my_link.get('href'))
输出
https://tutorialspoint.com/index.htm https://tutorialspoint.com/questions/index.php https://tutorialspoint.com/online_dev_tools.htm https://tutorialspoint.com/codingground.htm https://tutorialspoint.com/current_affairs/index.htm https://tutorialspoint.com/upsc_ias_exams.htm https://tutorialspoint.com/tutor_connect/index.php https://tutorialspoint.com/programming_examples/ https://tutorialspoint.com/whiteboard.htm https://tutorialspoint.com/netmeeting.php https://tutorialspoint.com/articles/ https://tutorialspoint.com/index.htm https://tutorialspoint.com/tutorialslibrary.htm https://tutorialspoint.com/videotutorials/index.htm https://store.tutorialspoint.com https://tutorialspoint.com/html_online_training/index.asp https://tutorialspoint.com/css_online_training/index.asp https://tutorialspoint.com/3d_animation_online_training/index.asp https://tutorialspoint.com/swift_4_online_training/index.asp https://tutorialspoint.com/blockchain_online_training/index.asp https://tutorialspoint.com/reactjs_online_training/index.asp https://tutorialspoint.com/tutorialslibrary.htm https://tutorialspoint.com/computer_fundamentals/index.htm https://tutorialspoint.com/compiler_design/index.htm https://tutorialspoint.com/operating_system/index.htm https://tutorialspoint.com/data_structures_algorithms/index.htm https://tutorialspoint.com/dbms/index.htm https://tutorialspoint.com/data_communication_computer_network/index.htm https://tutorialspoint.com/academic_tutorials.htm https://tutorialspoint.com/html/index.htm https://tutorialspoint.com/css/index.htm https://tutorialspoint.com/javascript/index.htm https://tutorialspoint.com/php/index.htm https://tutorialspoint.com/angular4/index.htm https://tutorialspoint.com/mysql/index.htm https://tutorialspoint.com/web_development_tutorials.htm https://tutorialspoint.com/cprogramming/index.htm https://tutorialspoint.com/cplusplus/index.htm https://tutorialspoint.com/java8/index.htm https://tutorialspoint.com/python/index.htm https://tutorialspoint.com/scala/index.htm https://tutorialspoint.com/csharp/index.htm https://tutorialspoint.com/computer_programming_tutorials.htm https://tutorialspoint.com/java8/index.htm https://tutorialspoint.com/jdbc/index.htm https://tutorialspoint.com/servlets/index.htm https://tutorialspoint.com/spring/index.htm https://tutorialspoint.com/hibernate/index.htm https://tutorialspoint.com/swing/index.htm https://tutorialspoint.com/java_technology_tutorials.htm https://tutorialspoint.com/android/index.htm https://tutorialspoint.com/swift/index.htm https://tutorialspoint.com/ios/index.htm https://tutorialspoint.com/kotlin/index.htm https://tutorialspoint.com/react_native/index.htm https://tutorialspoint.com/xamarin/index.htm https://tutorialspoint.com/mobile_development_tutorials.htm https://tutorialspoint.com/mongodb/index.htm https://tutorialspoint.com/plsql/index.htm https://tutorialspoint.com/sql/index.htm https://tutorialspoint.com/db2/index.htm https://tutorialspoint.com/mysql/index.htm https://tutorialspoint.com/memcached/index.htm https://tutorialspoint.com/database_tutorials.htm https://tutorialspoint.com/asp.net/index.htm https://tutorialspoint.com/entity_framework/index.htm https://tutorialspoint.com/vb.net/index.htm https://tutorialspoint.com/ms_project/index.htm https://tutorialspoint.com/excel/index.htm https://tutorialspoint.com/word/index.htm https://tutorialspoint.com/microsoft_technologies_tutorials.htm https://tutorialspoint.com/big_data_analytics/index.htm https://tutorialspoint.com/hadoop/index.htm https://tutorialspoint.com/sas/index.htm https://tutorialspoint.com/qlikview/index.htm https://tutorialspoint.com/power_bi/index.htm https://tutorialspoint.com/tableau/index.htm https://tutorialspoint.com/big_data_tutorials.htm https://tutorialspoint.com/tutorialslibrary.htm https://tutorialspoint.com/codingground.htm https://tutorialspoint.com/coding_platform_for_websites.htm https://tutorialspoint.com/developers_best_practices/index.htm https://tutorialspoint.com/effective_resume_writing.htm https://tutorialspoint.com/computer_glossary.htm https://tutorialspoint.com/computer_whoiswho.htm https://tutorialspoint.com/questions_and_answers.htm https://tutorialspoint.com/multi_language_tutorials.htm https://itunes.apple.com/us/app/tutorials-point/id914891263?ls=1&mt=8 https://play.google.com/store/apps/details?id=com.tutorialspoint.onlineviewer http://www.windowsphone.com/s?appid=91249671-7184-4ad6-8a5f-d11847946b09 /about/index.htm /about/about_team.htm /about/about_careers.htm /about/about_privacy.htm /about/about_terms_of_use.htm https://tutorialspoint.com/articles/ https://tutorialspoint.com/online_dev_tools.htm https://tutorialspoint.com/free_web_graphics.htm https://tutorialspoint.com/online_file_conversion.htm https://tutorialspoint.com/shared-tutorials.php https://tutorialspoint.com/netmeeting.php https://tutorialspoint.com/free_online_whiteboard.htm https://tutorialspoint.com https://www.facebook.com/tutorialspointindia https://plus.google.com/u/0/+tutorialspoint http://www.twitter.com/tutorialspoint http://www.linkedin.com/company/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://tutorialspoint.com/index.htm /about/about_privacy.htm#cookies /about/faq.htm /about/about_helping.htm /about/contact_us.htm
**Lxml** - 这是一个高性能、生产级的 HTML 和 XML 解析库。如果需要高质量和最大速度,则应使用此库。它有很多模块可以从网站提取数据。
安装方法:在命令提示符中输入
pip install lxml
示例
from lxml import etree my_root_elem = etree.Element('html') etree.SubElement(my_root_elem, 'head') etree.SubElement(my_root_elem, 'title') etree.SubElement(my_root_elem, 'body') print(etree.tostring(my_root_elem, pretty_print = True).decode("utf-8"))
输出
<html> <head/> <title/> <body/> </html>
**Selenium** - 这是一种自动化浏览器工具,也称为 web 驱动程序。当使用任何网站时,有时需要等待一段时间,例如单击按钮或滚动页面时,此时需要 Selenium。
安装 Selenium 使用以下命令:
pip install selenium
示例
from selenium import webdriver my_path_to_chromedriver ='/Users/Admin/Desktop/chromedriver' my_browser = webdriver.Chrome(executable_path = my_path_to_chromedriver) my_url = 'https://tutorialspoint.com/' my_browser.get(my_url)
输出
**MechanicalSoup** - 这是另一个用于自动化与网站交互的 Python 库。使用它可以自动存储和发送 Cookie,可以跟踪重定向,可以跟踪链接并提交表单。它不执行 JavaScript。
安装方法:使用以下命令
pip install MechanicalSoup
示例
import mechanicalsoup my_browser = mechanicalsoup.StatefulBrowser() my_value = my_browser.open("https://tutorialspoint.com/") print(my_value) my_val = my_browser.get_url() print(my_val) my_va = my_browser.follow_link("forms") print(my_va) my_value1 = my_browser.get_url() print(my_value1)
广告