- scrapy 教程
- scrapy 主页
- scrapy 基本概念
- scrapy 概述
- scrapy 环境
- scrapy 命令行工具
- scrapy - spiders
- scrapy - 提取器
- scrapy - Items
- scrapy - Item 加载器
- scrapy - Shell
- scrapy - item 管道
- scrapy - 输出
- scrapy - 请求和响应
- scrapy - 链接提取器
- scrapy - 设置
- scrapy - 异常
- scrapy 项目实战
- scrapy - 创建项目
- scrapy - 定义项
- scrapy - 第一个蜘蛛
- Scrapy - 抓取
- scrapy - 提取项
- scrapy - 使用项
- scrapy - 跟踪链接
- scrapy - 抓取数据
- scrapy 有用资源
- scrapy - 快速指南
- scrapy - 有用资源
- scrapy - 讨论
Scrapy - 抓取
说明
要在 first_scrapy 目录中执行蜘蛛,请运行以下命令:
scrapy crawl first
其中,first 是在创建蜘蛛时指定的蜘蛛名称。
蜘蛛爬取后,您可以看到以下输出:
2016-08-09 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2016-08-09 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Spider opened
2016-08-09 18:13:08-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] INFO: Closing spider (finished)
在输出中可以看到,对于每个 URL,都有一行日志,它状态 (referer: None),表明这些 URL 是起始 URL,它们没有引荐人。接下来,您应会看到在 first_scrapy 目录中创建了两个名为 Books.html 和 Resources.html 的新文件。
广告