爬虫实战案例

899次阅读
没有评论

前言:

实战携程案例,也是一个学习总结

目标:

全国景区名

1.数据分析

目标网站pc段实在不好分析,于是想到了手机端

爬虫实战案例

 目标网站异步加载,目标数据存放在名为getAttractioList?*****下

爬虫实战案例

 分析一遍后发现一个头疼的事,下一页的token参数在上一页的响应数据里,这里后面代码要注意一下

因为目标要求是全国景区数据,这里只是上海,通过分析负载数据,得知count是每页的数据条,districtId是城市Id,上海就是2,北京是1

爬虫实战案例

 通过对响应数据的分析,这里用xpath解析最方便,在postman分析一段时间后也是将初步数据拿下来了

代码块

self.jq_payload ={ “index”: 1, “count”: 20, “sortType”: 1, “isShowAggregation”: True, “districtId”: 2, “scene”: “DISTRICT”, “pageId”: “214062”, “traceId”: “63675663-049d-d442-3189-93e543806482”, “extension”: [ { “name”: “osVersion”, “value”: “13.2.3” }, { “name”: “deviceType”, “value”: “ios” } ], “filter”: { “filterItems”: [] }, “crnVersion”: “2020-09-01 22:00:45”, “isInitialState”: True, “head”: { “cid”: “09031138417081694673”, “ctok”: “”, “cver”: “1.0”, “lang”: “01”, “sid”: “8888”, “syscode”: “09”, “xsid”: “”, “extension”: [] } } self.jq_headers = { ‘Accept’: ‘*/*’, ‘Accept-Language’: ‘zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,pl;q=0.5’, ‘Connection’: ‘keep-alive’, ‘Cookie’: ‘GUID=09031138417081694673; nfes_isSupportWebP=1; _bfaStatusPVSend=1; _RGUID=e8d913f6-e1e1-4b35-a53d-45bb72e4f277; _RDG=288799a623152421103541a50c6b033c50; _RSG=riKuXR_4PZEt9FvojRHzKA; MKT_CKID=1658192357207.a68d9.kssa; appFloatCnt=1; _ga=GA1.2.1863916334.1658192399; ibu_h5_lang=en; ibu_h5_local=en-us; _RF1=36.63.50.106; ibulanguage=CN; ibulocale=zh_cn; cookiePricesDisplayed=CNY; MKT_CKID_LMT=1659943689861; _gcl_au=1.1.1499402294.1659943691; Session=smartlinkcode=U130727&smartlinklanguage=zh&SmartLinkKeyWord=&SmartLinkQuary=&SmartLinkHost=; _jzqco=%7C%7C%7C%7C1659943690175%7C1.107574341.1658192357204.1659943689865.1659943699647.1659943689865.1659943699647.0.0.0.6.6; __zpspc=9.3.1659943689.1659943699.2%233%7Ccn.bing.com%7C%7C%7C%7C%23; _bfs=1.2; _bfi=p1%3D102001%26p2%3D102001%26v1%3D37%26v2%3D36; _bfaStatus=success; MKT_Pagesource=H5; _gid=GA1.2.199342908.1659943769; _gat=1; mktDpLinkSource=ullink; _pd=%7B%22r%22%3A14%2C%22d%22%3A293%2C%22_d%22%3A279%2C%22p%22%3A293%2C%22_p%22%3A0%2C%22o%22%3A295%2C%22_o%22%3A2%2C%22s%22%3A296%2C%22_s%22%3A1%7D; _bfa=1.1652697082838.1x1gqg.1.1659856911096.1659943805593.4.41.214062; _ubtstatus=%7B%22vid%22%3A%221652697082838.1x1gqg%22%2C%22sid%22%3A4%2C%22pvid%22%3A41%2C%22pid%22%3A214062%7D; Union=OUID=&AllianceID=4902&SID=353701&SourceID=&AppID=&OpenID=&exmktID=&createtime=1659943806&Expires=1660548605624; MKT_OrderClick=ASID=4902353701&AID=4902&CSID=353701&OUID=&CT=1659943805625&CURL=https%3A%2F%2Fm.ctrip.com%2Fwebapp%2Fyou%2Fgspoi%2Fsight%2F2.html%3Fseo%3D0%26allianceid%3D4902%26sid%3D353701%26isHideNavBar%3DYES%26from%3Dhttps%253A%252F%252Fm.ctrip.com%252Fwebapp%252Fyou%252Fgsdestination%252Fplace%252F2.html%253Fseo%253D0%2526ishideheader%253Dtrue%2526secondwakeup%253Dtrue%2526dpclickjump%253Dtrue%2526allianceid%253D4902%2526sid%253D353701%2526sourceid%253D55551831%2526from%253Dhttps%25253A%25252F%25252Fm.ctrip.com%25252Fhtml5%25252F&VAL={“h5_vid”:”1652697082838.1x1gqg”}’, ‘Origin’: ‘https://m.ctrip.com’, ‘Referer’: ‘https://m.ctrip.com/webapp/you/gspoi/sight/2.html?seo=0&allianceid=4902&sid=353701&isHideNavBar=YES&from=https%3A%2F%2Fm.ctrip.com%2Fwebapp%2Fyou%2Fgsdestination%2Fplace%2F2.html%3Fseo%3D0%26ishideheader%3Dtrue%26secondwakeup%3Dtrue%26dpclickjump%3Dtrue%26allianceid%3D4902%26sid%3D353701%26sourceid%3D55551831%26from%3Dhttps%253A%252F%252Fm.ctrip.com%252Fhtml5%252F’, ‘Sec-Fetch-Dest’: ’empty’, ‘Sec-Fetch-Mode’: ‘cors’, ‘Sec-Fetch-Site’: ‘same-origin’, ‘User-Agent’: ‘Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1 Edg/104.0.5112.81’, ‘content-type’: ‘application/json’, ‘cookieOrigin’: ‘https://m.ctrip.com’ } def get_jq_data(self): #json数据用来传递网页本身数据类型为json的 response=requests.post(self.jq_url,headers=self.jq_headers,json=self.jq_payload) print(response.text)

爬虫实战案例

 这只是单个城市的单页数据,要想获得全国城市的信息就要获得全国城市的代号Id,这里我去到酒店页面去找了发现代号基本相同

最后通过拿前一页token,构造翻页参数基本完成,主要代码如下

def parse_jq_data(self): cname, cityId = self.parse_city_data() # 调用该方法获得所有城市名和ID for cnames, cityIds in zip(cname, cityId): print(“城市名字:”, cnames) print(“城市ID:”, cityIds) self.jq_payload[‘districtId’] = cityIds # 切换每个城市ID one_data = self.get_jq_data() # 调用请求景点的方法,像不同城市景点发起请求 # 构造翻页参数 totalCount = jsonpath(one_data, ‘$..totalCount’)[0] # print(totalCount) # 向上取整计算出总页数 page = math.ceil(totalCount / 20) # 总页数 # print(page) for i in range(1, page+1): print(‘当前城市名为{},城市ID为{},此时正在下载{}页’.format(cnames, cityIds, i)) self.jq_payload[‘index’] = i # 修改翻页参数 # 像每一页参数发起请求 page_json = self.get_jq_data() token = jsonpath(page_json, ‘$..token’)[0] print(‘token值为:’, token) self.jq_payload[‘token’] = token poiName = jsonpath(page_json, ‘$..poiName’) # 景区名字 detailUrl = jsonpath(page_json, ‘$..detailUrl’) # 链接 if poiName == False: poiName = [‘无’] if detailUrl == False: detailUrl = [‘无’] for p, d in zip(poiName, detailUrl): print(p) print(d) print(‘=’ * 10)

保存代码就偷懒没写了,不管这个数据很大,建议保存到数据库

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

相关文章:

版权声明:Python教程2022-10-28发表,共计4621字。
新手QQ群:570568346,欢迎进群讨论 Python51学习