BeautifulSoup 和 Scrapy 爬虫的区别
Beautiful Soup 和 Scrapy 爬虫都用于 Python 中的网页抓取。这两个工具的用例相同,但功能不同。网页抓取在研究、营销和商业智能等领域的数据收集和分析中非常有用。本文将了解 Beautiful Soup 和 Scrapy 爬虫之间的区别以及它们在网页抓取中的应用。
特性 |
Beautiful Soup |
Scrapy |
|---|---|---|
解析 |
用于解析 HTML 和 XML 文档 |
结合解析和爬取从网站提取数据。 |
易用性 |
易于使用的库 |
更复杂的库,用户需要良好的编程能力。 |
并发性 |
不支持并发,一次只能抓取一个页面。 |
支持并发,可以同时抓取多个页面,这使得它对于大型网页抓取项目更快更高效。 |
中间件 |
不提供任何中间件系统 |
提供中间件系统,允许开发者在抓取过程的不同阶段定制蜘蛛的行为。 |
数据存储 |
不提供内置数据存储支持,需要开发者手动处理数据存储 |
提供内置支持,可以将数据存储在各种格式中,例如 CSV、JSON 和 XML,并且还支持与 MySQL 和 MongoDB 等数据库集成 |
健壮性 |
与 Scrapy 相比,健壮性和容错性较差 |
更健壮,具有内置的错误处理机制,例如重试失败的请求、处理超时以及避免常见的错误,例如 404 和 403 |
社区 |
有一个社区,但不如 Scrapy 的社区庞大活跃 |
拥有庞大而活跃的开发者和贡献者社区,他们不断改进和更新框架。 |
Beautiful Soup
Beautiful Soup 是一个开源的 Python 库,用于解析 HTML 和 XML 页面。HTML 页面的解析有助于从网页提取数据。该库包含各种函数,可用于在 HTML 文档中搜索特定的标签、链接和其他属性。如果要抓取的数据在一个页面上,Beautiful Soup 是最佳选择。
示例
在下面的示例中,使用 Beautiful Soup 和 requests 库打印网页中存在的所有链接。首先,您需要导入 requests 库和 Beautiful Soup,然后向页面的 URL 发出 get 请求,并使用 Beautiful Soup 解析收到的 HTML 内容。页面解析后,您可以使用 Beautiful Soup 方法找到页面上的所有链接。
import requests
from bs4 import BeautifulSoup
# Make a request to the webpage
url = 'https://tutorialspoint.com/index.htm'
response = requests.get(url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all links on the page
links = soup.find_all('a')
# Print the links
for link in links:
print(link.get('href'))
输出
https://tutorialspoint.com/index.htm https://tutorialspoint.com/codingground.htm https://tutorialspoint.com/about/about_careers.htm https://tutorialspoint.com/whiteboard.htm https://tutorialspoint.com/online_dev_tools.htm https://tutorialspoint.com/business/index.asp https://tutorialspoint.com/market/teach_with_us.jsp https://#/tutorialspointindia https://www.instagram.com/tutorialspoint_/ https://twitter.com/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://www.linkedin.com/authwall?trk=bf&trkInfo=AQEkqX2eckF__gAAAX-wMwEYvrsjBVbEtWQd4pgEdVSzkL22Nik1KEpY_ECWLKDGc41z8IOZWr2Bb0fvJplT60NPBtSw87J6QCpc7wD4qQ3iU13n6xJtBxME5o05Wmpg5JPm5YY=&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Ftutorialspoint index.htm None https://tutorialspoint.com/categories/development https://tutorialspoint.com/categories/it_and_software https://tutorialspoint.com/categories/data_science_and_ai_ml https://tutorialspoint.com/categories/cyber_security https://tutorialspoint.com/categories/marketing https://tutorialspoint.com/categories/office_productivity https://tutorialspoint.com/categories/business https://tutorialspoint.com/categories/lifestyle https://tutorialspoint.com/latest/prime-packs https://tutorialspoint.com/market/index.asp https://tutorialspoint.com/latest/ebooks https://tutorialspoint.com/tutorialslibrary.htm https://tutorialspoint.com/articles/index.php https://tutorialspoint.com/market/login.asp https://tutorialspoint.com/latest/prime-packs https://tutorialspoint.com/market/index.asp https://tutorialspoint.com/latest/ebooks https://tutorialspoint.com/tutorialslibrary.htm https://tutorialspoint.com/articles/index.php https://tutorialspoint.com/codingground.htm
Scrapy
Scrapy 也是一个 Python 框架,用于网络爬取和网页抓取。当我们需要进行大规模项目的数据抓取时,使用 Scrapy。它还提供各种功能来提取、存储和处理数据。当您需要从多个页面抓取复杂数据时,Scrapy 是最佳选择。
示例
在下面的示例中,我们使用 Scrapy 从一个报价网站的多个页面抓取数据。为此,您需要定义一个 Scrapy 蜘蛛,它开始向网站的第一页发出请求,然后解析页面并从页面提取数据,然后按照下一页链接继续,直到没有更多页面可抓取。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
# Create a Scrapy process
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# Start the spider
process.crawl(QuotesSpider)
# Run the spider and display the output
process.start()
for item in QuotesSpider().parse(response=None):
print(item)
输出
2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200)(referer: http://quotes.toscrape.com/page/8/) 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“Anyone who has never made a mistake has never tried anything new.”', 'author': 'Albert Einstein', 'tags': ['mistakes']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': "“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”", 'author': 'Jane Austen', 'tags': ['humor', 'love', 'romantic', 'women']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”', 'author': 'J.K. Rowling', 'tags': ['integrity']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”', 'author': 'Jane Austen', 'tags': ['books', 'library', 'reading']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”', 'author': 'Jane Austen', 'tags': ['elizabeth-bennet', 'jane-austen']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“Some day you will be old enough to start reading fairy tales again.”', 'author': 'C.S. Lewis', 'tags': ['age', 'fairytales', 'growing-up']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”', 'author': 'C.S. Lewis', 'tags': ['god']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”', 'author': 'Mark Twain', 'tags': ['death', 'life']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“A lie can travel half way around the world while the truth is putting on its shoes.”', 'author': 'Mark Twain', 'tags': ['misattributed-mark-twain', 'truth']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/> {'text': '“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”', 'author': 'C.S. Lewis', 'tags': ['christianity', 'faith', 'religion', 'sun']} 2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://quotes.toscrape.com/page/9/) 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”', 'author': 'J.K. Rowling', 'tags': ['truth']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': "“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”", 'author': 'Jimi Hendrix', 'tags': ['death', 'life']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“To die will be an awfully big adventure.”', 'author': 'J.M. Barrie', 'tags': ['adventure', 'love']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“It takes courage to grow up and become who you really are.”', 'author': 'E.E. Cummings', 'tags': ['courage']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“But better to get hurt by the truth than comforted with a lie.”', 'author': 'Khaled Hosseini', 'tags': ['life']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”', 'author': 'Harper Lee', 'tags': ['better-life-empathy']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”', 'author': "Madeleine L'Engle", 'tags': ['books', 'children', 'difficult', 'grown-ups', 'write', 'writers', 'writing']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“Never tell the truth to people who are not worthy of it.”', 'author': 'Mark Twain', 'tags': ['truth']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': "“A person's a person, no matter how small.”", 'author': 'Dr. Seuss', 'tags': ['inspirational']} 2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/> {'text': '“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”', 'author': 'George R.R. Martin', 'tags': ['books', 'mind']} 2023-04-17 00:53:00 [scrapy.core.engine] INFO: Closing spider (finished)
结论
在本文中,我们讨论了 Python 中 Beautiful Soup 和 Scrapy 之间的区别。尽管两者都用于网页抓取,但它们的功能不同。当我们需要从单个页面抓取数据时,使用 Beautiful Soup;当我们需要从多个页面抓取大量数据时,使用 Scrapy。
数据结构
网络
关系数据库管理系统 (RDBMS)
操作系统
Java
iOS
HTML
CSS
Android
Python
C语言编程
C++
C#
MongoDB
MySQL
Javascript
PHP