python爬虫之十一

919次阅读
没有评论
python爬虫之十一

scrapy_实战经验

一、filespipeline 下载文件

1.setiing设置
ITEM_PIPELINES、FILES_STORE两个参数缺一不可,将FILES_STORE写成FILE_STORE导致pipeline未启动

20200213 15:48:27 [scrapy.middleware] INFO: Enabled item pipelines: []

2.下载结果
urlretrieve下载文件,边下载文件边显示,而且urlretrieve在参数中要指定下载的文件名
filespipeline下载文件,文下载完成后文件才显示,文件名自动生成,如果url中有文件的后缀名则自动采用后缀名,如果没有需要自己手动改

3.下载设置
filespipeline下载文件设置限制条件:下载时间(DOWNLOAD_TIMEOUT)、重试次数(Retring)、下载重定向(Redirect)、同一域名同时下载最大个数、同一IP同时下载最大个数

当下载文件超时会停止,

4.下载过程

5.下载报错

{'downloader/exception_count': 29, 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 1, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 20, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8, 'downloader/request_bytes': 89242, 'downloader/request_count': 202, 'downloader/request_method_count/GET': 202, 'downloader/response_bytes': 3689994587, 'downloader/response_count': 260, 'downloader/response_status_count/200': 226, 'downloader/response_status_count/302': 30, 'downloader/response_status_count/404': 4, 'elapsed_time_seconds': 6032.416843, 'file_count': 42, 'file_status_count/downloaded': 41, 'file_status_count/uptodate': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 2, 13, 19, 25, 10, 570224), 'item_scraped_count': 75, 'log_count/DEBUG': 1469, 'log_count/ERROR': 19, 'log_count/INFO': 110, 'log_count/WARNING': 33, 'request_depth_max': 2, 'response_received_count': 230, 'retry/max_reached': 29, 'scheduler/dequeued': 185, 'scheduler/dequeued/memory': 185, 'scheduler/enqueued': 185, 'scheduler/enqueued/memory': 185, 'spider_exceptions/AttributeError': 4, 'spider_exceptions/TypeError': 15, 'start_time': datetime.datetime(2020, 2, 13, 17, 44, 38, 153381)} 20200214 03:25:10 [scrapy.core.engine] INFO: Spider closed (finished)

一共94个URL,spider解析失败15+4=19个,生成item75个;下载42个文件,异常29个,404有4个,一共75个;下载超时20个,请求超时1个,请求拒绝8个;
【404和其他异常都会导致文件没有下载,文件没有下载files为空】
【3个是selenium打不开,直接返回None的,应该是交给scrapy下载器仍然打不开,返回了404】

解析失败原因:
TypeError:
12例中,videoSrc不是video标签的属性,而是video标签内的source标签的属性;改正方法:改用//video//@src
3例中,selenium加载超时直接返回None;改正方法:改成点击body而不是div试一试(只要将焦点移过去就能触发JS而不需要点击);
AttributeError:
4例中,时间不是写死在HTML源码中的;改正方法:如果为空则以当天时间为日期,另外日期要改成YYYYMMDD格式;

File下载失败:
下载404原因:
下载超时原因:
TCP超时原因:
拒绝轻易原因:
都只有改成随机请求头试一试

另外,补上修改文件名的功能

20200214 01:56:08 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.abc1.com> (failed 1 times): User timeout caused connection failure: Getting http://www.abc.com took longer than 600.0 seconds.. 20200214 01:56:08 [scrapy.pipelines.files] WARNING: File (unknownerror): Error downloading file from <GET http://www.abc1.com> referred in <None>: User timeout caused connection failure: Getting http://www.abc.com took longer than 600.0 seconds..

20个下载超时全部是[scrapy.downloadermiddlewares.retry],注意这个downloadermiddlewares中间件指的是下载中文件中的,而不是自己设置的下载中间件

20200214 01:45:44 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.abc2.com> (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。. 20200214 01:45:44 [scrapy.pipelines.files] WARNING: File (unknownerror): Error downloading file from <GET http://www.abc2.com> referred in <None>: TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

20200214 01:49:36 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.abc3.com> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a nonclean fashion: Connection lost.>] 20200214 01:49:36 [scrapy.pipelines.files] WARNING: File (unknownerror): Error downloading file from <GET http://www.abc3.com> referred in <None>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a nonclean fashion: Connection lost.>]

爬虫代码错

20200214 01:46:54 [scrapy.core.scraper] ERROR: Spider error processing <GET https://shell.sososhell.com/embed/2e654ed621033add1a2f48e42a62f633> (referer: http://cl.3211i.xyz/htm_data/2002/22/3812228.html) File "F:…", line 49, in parse_realurl ... print('B' * 50+' '+videoSrc) TypeError: must be str, not NoneType

20200214 02:42:20 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "https://…"} 20200214 02:42:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 408 1155 20200214 02:42:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 打开超时:https://... 20200214 02:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://...> 1 (referer: http://...) 20200214 02:42:54 [scrapy.core.scraper] ERROR: Spider error processing <GET https://ppse026.com/play/video.php?id=15> (referer: http://cl.3211i.xyz/htm_data/2002/22/3812218.html) Traceback (most recent call last): File "F:…", line 49, in parse_realurl print('B' * 50+' '+videoSrc) TypeError: must be str, not NoneType

#selenium打开页面点击后加载超时,返回None,scrapy下载器直接打开页面(页面本来就能打开返回200,只是视频点击后加载超时),由于JS加载后才有videoSrc属性,所以当前为NoneType #另外发现,即使没有点击页面加载后也有videoSrc属性了,改为点击body试一试?

还有部分 http://333.thumbfox.com/embed/7986/
src属性不再video中,而在source中,改用//video//@src
这种有12例
12+3=15 TypeError

日期绝大多数页面是在源码中写死的,有4个是JS生成的,AttributeError

正常下载流程:

20200214 02:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://abc_2.com> (referer: http://abc_1.com) #此处省略步骤parse1执行:从abc_2.com response中解析获得abc_3.com,并生成Request(url=abc_3.com),交由下载器中间件中的selenium处理 20200214 02:53:25 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "abc_3.com"} 20200214 02:53:26 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 200 14 20200214 02:53:26 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 20200214 02:53:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_3.com> 1 (referer: abc_2.com.html) #此处省略步骤parse2执行:从abc_3.com response中解析获得abc_4.com,作为item的file_urls,交由filespipeline处理 20200214 03:02:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://abc_5.com> from <GET https://abc_4.com> 20200214 03:18:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_5.com> (referer: None) 20200214 03:18:04 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://abc_4.com> referred in <None> 20200214 03:18:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://abc_3.com> {'date': '2020-02-14', 'file_urls': ['https://abc_4.com'], 'files': [{'checksum': 'd0f5f18d89666750d2c0b85980791930', 'path': 'full/0f6e48154b4b3e886c7d2d96f254c1de0fbc4084', 'url': 'https://abc_4.com'}], 'title': '…'}

注意:
[scrapy.downloadermiddlewares.redirect] ,downloadermiddlewares这一条信息不一定会出现,也就是说重定向实际使用了系统自带的downloadermiddlewares

从以上流程中:
1.abc_1.com为目录页,abc_2.com为详情页,abc_3.com为详情页中的外链video页,abc_4.com为video页中videoSrc(实际url为http://abc_4.mp4),abc_5.com为videoSrc重定向的地址
2.abc_1.com、abc_2.com页面scrapy正常爬取,abc_3.com页面通过selenium爬取,abc_4.mp4通过filespipeline下载
3.[scrapy.core.engine] DEBUG: Crawled (200) 表示请求成功返回response
[scrapy.core.scraper] DEBUG: Scraped 表示对response解析完成(?这个自己猜的)
4.scrapy.core.engine通知响应完成(Crawled 200 https://abc_5.com)、scrapy.pipelines.files通知下载完成(Downloaded file from https://abc_4.com)、scrapy.core.scraper通知爬虫完成(Scraped from)基本发生在同时
5.在file下载完成后,显示对应item的信息,此时files属性自动产生了值;

下载失败时流程:
404流程

20200214 01:45:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://abc_2.com> (referer: http://abc_1.com) #此处省略步骤parse1执行:从abc_2.com response中解析获得abc_3.com,并生成Request(url=abc_3.com),交由下载器中间件中的selenium处理 20200214 01:45:53 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "abc_3.com"} 20200214 01:45:53 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 200 14 20200214 01:45:53 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 20200214 01:45:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_3.com> 1 (referer: abc_2.com.html) #此处省略步骤parse2执行:从abc_3.com response中解析获得abc_4.com,作为item的file_urls,交由filespipeline处理 20200214 01:48:00 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://162.209.170.99/20200213/62.mp4> (referer: None) 20200214 01:48:00 [scrapy.pipelines.files] WARNING: File (code: 404): Error downloading file from <GET http://162.209.170.99/20200213/62.mp4> referred in <None> 20200214 01:48:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ppse026.com/play/video.php?id=62> {'date': '2020-02-14', 'file_urls': ['https://abc_4.com'], 'files': [], 'title': '…'}

请求失败流程

20200214 01:48:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://abc_2.com> (referer: http://abc_1.com) #此处省略步骤parse1执行:从abc_2.com response中解析获得abc_3.com,并生成Request(url=abc_3.com),交由下载器中间件中的selenium处理 20200214 01:48:14 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "https://abc_3.com"} 20200214 01:48:17 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 200 14 20200214 01:48:17 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 20200214 01:48:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_3.com> 1 (referer: http://abc_2.com) #此处省略步骤parse2执行:从abc_3.com response中解析获得abc_4.com,作为item的file_urls,交由filespipeline处理 20200214 01:48:29 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://abc_5.com> from <GET https://abc_4.com> 20200214 01:49:36 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://abc_5.com> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a nonclean fashion: Connection lost.>] 20200214 01:49:36 [scrapy.pipelines.files] WARNING: File (unknownerror): Error downloading file from <GET https://abc_4.com> referred in <None>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a nonclean fashion: Connection lost.>] 20200214 01:49:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://abc_3.com> {'date': '2020-02-13', 'file_urls': ['https://abc_4.com'], 'files': [], 'title': '…'}

自定义FilesPipeline
1.将item传给scrapy后,scrapy检查file_urls属性,将file_urls属性加入调度器,发起请求;
2.注意,item的其他属性是不在

ImagesPipeline:
1.ImagesPipeline通过get_media_requests(self,item,info)方法从item中获取image_urls,image_urls(是url列表)创建Request列表,返回给scrapy
2.get_media_requests在请求发出前调用,file_path在请求完成将要把图片存储在本地时调用,所以file_path中要想用到item中的信息,必须将先item中信息绑定到Request对象上,然后再在file_path中获取
【自己试验发现在process_item中绑定在FilePipeline中也行,说明process_item在file_path之前调用】
3.file_path返回值要么是绝对路径,要么是相对FILES_STORE的相对路径;建议在FILES_STORE中写绝对路径(可以用__file__获取),file_path中再获取FILE_STORE,组合后返回,这样file_path返回直接就是绝对路径了
4.有个疑问:file_path中有错是不是不会报错,而是直接导致下载失败;另外,pipeline中的request经过下载器中间件处理吗

5.下载器中间件:process_request()返回值是什么?
修改请求头时,没写返回值!!!!

一个情形:
在yield dict情况下,如果dict中某个字段是列表,列表长度不定每个元素分别来自不同的URL,怎么实现?
scrapy中如果yield request,那么之后的response、解析结果都是自动处理并发到pipeline的,根本没法获取后作为中间结果。感觉上没法在python中获取完整对象后再存入数据库,只有先存进去,之后再不停地查询并写入。但是这样感觉会影响效率啊。
恩,对上面这种解决方式换个角度来想的话,可以将list中的元素单独作为一个实体。单独建一个table/collection。也就不存在先查找后写入的问题了。
scrapy的处理流程,有种很强的流水线的感觉,只能向前,不能回退。

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

相关文章:

版权声明:Python教程2022-10-25发表,共计11404字。
新手QQ群:570568346,欢迎进群讨论 Python51学习