BeautifulSoup 和 Scrapy 爬虫的区别


Beautiful Soup 和 Scrapy 爬虫都用于 Python 中的网页抓取。这两个工具的用例相同,但功能不同。网页抓取在研究、营销和商业智能等领域的数据收集和分析中非常有用。本文将了解 Beautiful Soup 和 Scrapy 爬虫之间的区别以及它们在网页抓取中的应用。

特性

Beautiful Soup

Scrapy

解析

用于解析 HTML 和 XML 文档

结合解析和爬取从网站提取数据。

易用性

易于使用的库

更复杂的库,用户需要良好的编程能力。

并发性

不支持并发,一次只能抓取一个页面。

支持并发,可以同时抓取多个页面,这使得它对于大型网页抓取项目更快更高效。

中间件

不提供任何中间件系统

提供中间件系统,允许开发者在抓取过程的不同阶段定制蜘蛛的行为。

数据存储

不提供内置数据存储支持,需要开发者手动处理数据存储

提供内置支持,可以将数据存储在各种格式中,例如 CSV、JSON 和 XML,并且还支持与 MySQL 和 MongoDB 等数据库集成

健壮性

与 Scrapy 相比,健壮性和容错性较差

更健壮,具有内置的错误处理机制,例如重试失败的请求、处理超时以及避免常见的错误,例如 404 和 403

社区

有一个社区,但不如 Scrapy 的社区庞大活跃

拥有庞大而活跃的开发者和贡献者社区,他们不断改进和更新框架。

Beautiful Soup

Beautiful Soup 是一个开源的 Python 库,用于解析 HTML 和 XML 页面。HTML 页面的解析有助于从网页提取数据。该库包含各种函数,可用于在 HTML 文档中搜索特定的标签、链接和其他属性。如果要抓取的数据在一个页面上,Beautiful Soup 是最佳选择。

示例

在下面的示例中,使用 Beautiful Soup 和 requests 库打印网页中存在的所有链接。首先,您需要导入 requests 库和 Beautiful Soup,然后向页面的 URL 发出 get 请求,并使用 Beautiful Soup 解析收到的 HTML 内容。页面解析后,您可以使用 Beautiful Soup 方法找到页面上的所有链接。

import requests
from bs4 import BeautifulSoup

# Make a request to the webpage
url = 'https://tutorialspoint.com/index.htm'
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all links on the page
links = soup.find_all('a')

# Print the links
for link in links:
   print(link.get('href'))

输出

https://tutorialspoint.com/index.htm
https://tutorialspoint.com/codingground.htm
https://tutorialspoint.com/about/about_careers.htm
https://tutorialspoint.com/whiteboard.htm
https://tutorialspoint.com/online_dev_tools.htm
https://tutorialspoint.com/business/index.asp
https://tutorialspoint.com/market/teach_with_us.jsp
https://#/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.linkedin.com/authwall?trk=bf&trkInfo=AQEkqX2eckF__gAAAX-wMwEYvrsjBVbEtWQd4pgEdVSzkL22Nik1KEpY_ECWLKDGc41z8IOZWr2Bb0fvJplT60NPBtSw87J6QCpc7wD4qQ3iU13n6xJtBxME5o05Wmpg5JPm5YY=&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Ftutorialspoint
index.htm
None
https://tutorialspoint.com/categories/development
https://tutorialspoint.com/categories/it_and_software
https://tutorialspoint.com/categories/data_science_and_ai_ml
https://tutorialspoint.com/categories/cyber_security
https://tutorialspoint.com/categories/marketing
https://tutorialspoint.com/categories/office_productivity
https://tutorialspoint.com/categories/business
https://tutorialspoint.com/categories/lifestyle
https://tutorialspoint.com/latest/prime-packs
https://tutorialspoint.com/market/index.asp
https://tutorialspoint.com/latest/ebooks
https://tutorialspoint.com/tutorialslibrary.htm
https://tutorialspoint.com/articles/index.php
https://tutorialspoint.com/market/login.asp
https://tutorialspoint.com/latest/prime-packs
https://tutorialspoint.com/market/index.asp
https://tutorialspoint.com/latest/ebooks
https://tutorialspoint.com/tutorialslibrary.htm
https://tutorialspoint.com/articles/index.php
https://tutorialspoint.com/codingground.htm

Scrapy

Scrapy 也是一个 Python 框架,用于网络爬取和网页抓取。当我们需要进行大规模项目的数据抓取时,使用 Scrapy。它还提供各种功能来提取、存储和处理数据。当您需要从多个页面抓取复杂数据时,Scrapy 是最佳选择。

示例

在下面的示例中,我们使用 Scrapy 从一个报价网站的多个页面抓取数据。为此,您需要定义一个 Scrapy 蜘蛛,它开始向网站的第一页发出请求,然后解析页面并从页面提取数据,然后按照下一页链接继续,直到没有更多页面可抓取。

import scrapy

class QuotesSpider(scrapy.Spider):
   name = "quotes"
   start_urls = [
      'http://quotes.toscrape.com/page/1/',
   ]

   def parse(self, response):
      for quote in response.css('div.quote'):
         yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
         }
      next_page = response.css('li.next a::attr(href)').get()
      if next_page is not None:
         yield response.follow(next_page, self.parse)

# Create a Scrapy process
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())

# Start the spider
process.crawl(QuotesSpider)

# Run the spider and display the output
process.start()
for item in QuotesSpider().parse(response=None):
   print(item)

输出

2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/page/8/)
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Anyone who has never made a mistake has never tried anything new.”', 'author': 'Albert Einstein', 'tags': ['mistakes']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': "“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”", 'author': 'Jane Austen', 'tags': ['humor', 'love', 'romantic', 'women']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”', 'author': 'J.K. Rowling', 'tags': ['integrity']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”', 'author': 'Jane Austen', 'tags': ['books', 'library', 'reading']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”', 'author': 'Jane Austen', 'tags': ['elizabeth-bennet', 'jane-austen']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Some day you will be old enough to start reading fairy tales again.”', 'author': 'C.S. Lewis', 'tags': ['age', 'fairytales', 'growing-up']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”', 'author': 'C.S. Lewis', 'tags': ['god']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”', 'author': 'Mark Twain', 'tags': ['death', 'life']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“A lie can travel half way around the world while the truth is putting on its shoes.”', 'author': 'Mark Twain', 'tags': ['misattributed-mark-twain', 'truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”', 'author': 'C.S. Lewis', 'tags': ['christianity', 'faith', 'religion', 'sun']}
2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/page/9/)
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”', 'author': 'J.K. Rowling', 'tags': ['truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': "“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”", 'author': 'Jimi Hendrix', 'tags': ['death', 'life']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“To die will be an awfully big adventure.”', 'author': 'J.M. Barrie', 'tags': ['adventure', 'love']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“It takes courage to grow up and become who you really are.”', 'author': 'E.E. Cummings', 'tags': ['courage']}     
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“But better to get hurt by the truth than comforted with a lie.”', 'author': 'Khaled Hosseini', 'tags': ['life']}  
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”', 'author': 'Harper Lee', 'tags': ['better-life-empathy']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”', 'author': "Madeleine L'Engle", 'tags': ['books', 'children', 'difficult', 'grown-ups', 'write', 'writers', 'writing']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“Never tell the truth to people who are not worthy of it.”', 'author': 'Mark Twain', 'tags': ['truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': "“A person's a person, no matter how small.”", 'author': 'Dr. Seuss', 'tags': ['inspirational']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”', 'author': 'George R.R. Martin', 'tags': ['books', 'mind']}
2023-04-17 00:53:00 [scrapy.core.engine] INFO: Closing spider (finished)

结论

在本文中,我们讨论了 Python 中 Beautiful Soup 和 Scrapy 之间的区别。尽管两者都用于网页抓取,但它们的功能不同。当我们需要从单个页面抓取数据时,使用 Beautiful Soup;当我们需要从多个页面抓取大量数据时,使用 Scrapy。

更新于:2023年7月6日

浏览量:159

开启您的职业生涯

完成课程获得认证

开始学习
广告
© . All rights reserved.