scrapy爬虫系列之四

1,198次阅读

没有评论

功能点：如何爬取列表页，并根据列表页获取详情页信息？

爬取网站：东莞ipipgo政务网

完整代码：https://files.cnblogs.com/files/bookwed/yangguang.zip

主要代码：

yg.py

import scrapy from yangguang.items import YangguangItem

class YgSpider(scrapy.Spider): name = 'yg' allowed_domains = ['sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/report']

def parse(self, response): tr_list = response.xpath("//div[@class='greyframe']/table[2]//tr") for tr in tr_list: item = YangguangItem() item["title"] = tr.xpath("./td[2]/a[2]/text()").extract_first() item["href"] = tr.xpath("./td[2]/a[2]/@href").extract_first() item["status"] = tr.xpath("./td[3]/span/text()").extract_first() item["publish_time"] = tr.xpath("./td[last()]/text()").extract_first() if type(item["href"]) == str: # 请求详情页 yield scrapy.Request( item["href"], callback=self.parse_detail, meta={"item": item} )

# 翻页 next_url = response.xpath("//a[text()='>']/@href").extract_first() if next_url is not None: yield scrapy.Request(next_url, callback=self.parse)

# 解析详情页 def parse_detail(self, response): item = response.meta["item"] # 获取详情页的内容、图片 item["content"] = response.xpath("//div[@class='wzy1']/table[2]//tr[1]/td[@class='txt16_3']/text()").extract() item["content_image"] = response.xpath("//div[@class='wzy1']/table[2]//tr[1]/td[@class='txt16_3']//img/@src").extract() item["content_image"] = ["http://wz.sun0769.com"+i for i in item["content_image"]] yield item # 对返回的数据进行处理

pipelines.py

class YangguangPipeline(object): def __init__(self): self.f = open('yangguang.json', 'w', encoding='utf-8')

def process_item(self, item, spider): item["content"] = self.process_content(item["content"]) self.f.write(json.dumps(dict(item), ensure_ascii=False) + ',n') return item

def process_content(self, content): # 对内容项里的xa0 和空白字符替换为空 content = [re.sub(r"xa0|s", "", i) for i in content] # 对替换过的空字符串去除 content = [i for i in content if len(i) > 0] return content

转载于:https://www.cnblogs.com/bookwed/p/10617789.html

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

scrapy爬虫系列之四

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置