scrapy实现多级页面爬取（初级练习题）

1,418次阅读

没有评论

练习题：Quotes to Scrapes【谚语网站】等级：初级

爬取每条谚语的信息（谚语、作者、标签、作者出生日期、作者出事地点、作者基本描述）

思路：

1、初始url[网站网址]：http://quotes.toscrape.com/

2、得到初始url的response，传递给parse1函数（负责解析第一级页面），解析response；

3、获取到每条谚语下一级页面的url，将其链接传递给parse2函数（负责解析第二级页面）入队列；

4、parse2函数会解析每个二级页面的url的response，得到最终数据；

易忽略点：

1、因为谚语的二级页面的url是根据作者来定义url路径的，因此有很多重复的二级url，需要不去重操作；

2、定位元素的时候要多观察页面元素的结构，在爬取的过程中，因为定位“下一页”元素写的不精准导致总是少了很多条数据，一开始以为是去重机制导致的，加上了不去重却让爬虫陷入了死循环。

next_url = response.xpath('//ul[@class="pager"]//a/@href').extract()[0] if next_url is not None: yield scrapy.Request(url=self.base_url + next_url, callback=self.parse)

这样写next_url获取到的是前一页的url，if里的条件永远满足。分析元素树结构的时候忽略了还有前一页的元素导致的。正确的next_url代码见下方。

准备工作：新建项目、新建爬虫

明确目标：item.py

import scrapy

class QuotesToscrapeItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 名言 quote = scrapy.Field() # 作者 author = scrapy.Field() # 标签 tags = scrapy.Field()

# 出生日期 born_date = scrapy.Field() # 出生位置 born_location = scrapy.Field() # 描述 description = scrapy.Field()

定义爬虫：quotes.py

[去重机制：Request的参数dont_filter默认是False（去重），每yield一个Request，就将url参数与调度器内已有的url进行比较，如果存在相同url则默认不入队列，如果没有相同的url则入队列，每一个url入队列前都要与现有的url进行比较。如果想要实现不去重效果，则将dont_filter改为True]

# -*- coding: utf-8 -*- import scrapy from quotes_toscrape.items import QuotesToscrapeItem

class QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = ['quotes.toscrape.com'] # start_urls = ['http://quotes.toscrape.com/'] base_url = 'http://quotes.toscrape.com/'

page = 1 first_url ='http://quotes.toscrape.com/page/{}/' start_urls = [first_url.format(page)]

def parse(self, response): node_list = response.xpath('//div[@class="quote"]') for node in node_list: quote = node.xpath('.//span[@class="text"]/text()').extract()[0][1:-1] author = node.xpath('.//small/text()').extract()[0] tags = node_list.xpath('.//div[@class="tags"]//a/text()').extract()[0] href = self.base_url + node.xpath('.//small/following-sibling::a/@href').extract()[0] yield scrapy.Request(url=href, meta={'quote': quote, 'author': author, 'tags': tags}, callback=self.parse_author, dont_filter=True)

next_url = response.xpath('//ul[@class="pager"]/li[@class="next"]/a/@href').extract()[0] if next_url is not None: yield scrapy.Request(url=self.base_url + next_url, callback=self.parse) # if self.page<10: # self.page += 1 # yield scrapy.Request(url=self.first_url.format(self.page), callback=self.parse)

def parse_author(self,response): item = QuotesToscrapeItem() # 组合信息 item['quote'] = response.meta['quote'] item['author'] = response.meta['author'] item['tags'] = response.meta['tags'] item['born_date'] = response.xpath('//span[@class="author-born-date"]/text()').extract()[0] item['born_location'] = response.xpath('//span[@class="author-born-location"]/text()').extract()[0][3:] # 去掉前后空格 item['description'] = response.xpath('//div[@class="author-description"]/text()').extract()[0].strip() yield item

定义管道：pipelines.py

import json

class QuotesToscrapePipeline(object): def __init__(self): self.file = open('quotes.json','wb')

def process_item(self, item, spider): data = json.dumps(dict(item),ensure_ascii=False,indent=4) +',' # 编码 self.file.write(data.encode('utf-8')) return item

def close_spider(self,spider): self.file.close()

scrapy实现多级页面爬取（初级练习题）

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置