Python爬虫笔记（十）——Scrapy官方文档阅读——Scrapy shell

656次阅读

Scrapy shell用于测试Xpath和css表达式，查看它们提取的数据，Scrapy可以使用ipython、bpython、标准的python shell中的一个，可以通过设置SCRAPY_PYTHON_SHELL的值来决定，也可以在scrapy.cfg中定义：

[settings] shell = bpython

启动scrapy shell的命令：

scrapy shell <url>

url是自己想要爬取页面的url，shell也可以与本地的文件一起工作

# UNIX-style scrapy shell ./path/to/file.html scrapy shell ../other/path/to/file.html scrapy shell /absolute/path/to/file.html

# File URI scrapy shell file:///absolute/path/to/file.html

当使用相对路径时，需要使用./，因此，当使用scrapy index.html时，将会出现问题，由于Scrapy更偏向于HTTP 的url，所以index.html会被当成域名进行DNS查询

shelp（）：查看可用命令

fetch（url[，redirect=True]）：对url发起请求，获取响应，更新所有的相关对象（例如response对象），如果不想进行重定向，可以将redirect·置为false

fetch（request）：根据request获取响应，更新所有相关对象

view（response）：通过本地的浏览器打开response，response会保存为一个文件

使用Ctrl-Z可以退出当前的shell环境

crawler：目前的crawler对象，相应的API：https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.Crawler

spider：当前的spider对象，spider类定义了如何爬取某个网站，包括了爬取的动作以及如何从网页中提取结构化数据

request：最后一个爬取页面的request对象，可以通过replace（）方法

response：最后一个请求url的应答

settings：当前的Scrapy设置

如果我们先用shell检查我们自己爬虫的response，可以在代码中插入inspect_response（）函数：

import scrapy

class MySpider(scrapy.Spider): name = "myspider" start_urls = [ "http://example.com", "http://example.org", "http://example.net", ]

def parse(self, response): # We want to inspect one specific response. if ".org" in response.url: from scrapy.shell import inspect_response inspect_response(response, self)

# Rest of parsing code.

相当于在代码中插入了一个中断，接下来和使用scrapy shell一样

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None) 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> …

>>> response.url 'http://example.org'

此时Scrapy的引擎是被阻塞的，使用fetch命令是没有用的，当我们关闭了shell环境，函数将会继续运行

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

Python爬虫笔记（十）——Scrapy官方文档阅读——Scrapy shell

启动scrapy shell

shell命令的参数

可用的Scrapy对象

从spider中调用shell来检查response

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置