selenium实战：窗口化爬取*宝数据

725次阅读

没有评论

双十一刚过，双十二马上又来了，想着某宝的信息看起来有些少很难做出购买决定。于是就有了下面的设计：

selenium实战：窗口化爬取*宝数据

既然有了想法那就赶紧说干就干趁着双十二还没到

selenium实战：窗口化爬取*宝数据

一、准备工作：
安装：selenium 和 tkinter

pip install selenium

pip install tkinter

下载火狐浏览器驱动

二、网站分析
发现web端如果不登录就不能进行查找商品

selenium实战：窗口化爬取*宝数据

登录后查找口红

发现url竟然张这样

https://s.taobao.com/search?q=口红&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20211117&ie=utf8&bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s=44

通过观察发现url中的q=**表示的是搜索的内容 s=**表示页数

接下来确定网页中我们将要采集的数据

selenium实战：窗口化爬取*宝数据

采集的数据有：商品价格；付款人数；商品标题；店铺url；店家地址；

三、代码编写
1、类库引用

import json import pandas as pd from selenium import webdriver import time from tkinter import * import tkinter.messagebox

2、窗口化代码实现

# 设置窗口 window = Tk() window.title('qcc_nw0.1') # 设置窗口大小 window.geometry('500×200') # lable标签 l = Label(window, text='如何真正逛淘宝！！', bg='green', fg='white', font=('Arial', 12), width=30, height=2) l.pack() # 输入要查询的宝贝的文本框 E1 = Text(window,width='100',height='2') E1.pack() def get_cookie(): pass def get_data(): pass # cookie获取按钮 cookie = Button(window, text='cookie获取', font=('Arial', 10), width=15, height=1,ommand=get_cookie) # 数据开按钮 data = Button(window, text='数据获取', font=('Arial', 10), width=15, height=1,ommand=get_data) cookie.pack(anchor='nw') data.pack(anchor='nw') window.mainloop()

selenium实战：窗口化爬取*宝数据

3、免登陆功能实现
对已经登录网站的cookie获取

def get_cookie(): # 新建浏览器 dirver = webdriver.Firefox() dirver.get('https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Fbuyertrade.taobao.com%2Ftrade%2Fitemlist%2Flist_bought_items.htm%3Fspm%3D875.7931836%252FB.a2226mz.4.66144265Vdg7d5%26t%3D20110530') # 设置登录延时获取cookie time.sleep(20) # 直接用手机扫码登陆淘宝即可获取 dictCookies = dirver.get_cookies() # 登录完成后,将cookies保存到本地文件 jsonCookies = json.dumps(dictCookies) with open("cookies_tao.json", "w") as fp: fp.write(jsonCookies)

读取获取后的cookie实现登录效果：

1）先对selenium使用的模拟浏览器进行下伪装设置否则会被检测

def get_data(): options = webdriver.FirefoxOptions() profile = webdriver.FirefoxProfile() ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' profile.set_preference('general.useragent.override', ua)#UA伪装 profile.set_preference("dom.webdriver.enabled", False) # 设置非driver驱动 profile.set_preference('useAutomationExtension', False) # 关闭自动化提示 profile.update_preferences() # 更新设置 browser = webdriver.Firefox(firefox_profile=profile, firefox_options=options)

2）读取获取到的cookie实现免登陆

# 删除原有的cookie browser.delete_all_cookies() with open('cookies_tao.json', encoding='utf-8') as f: listCookies = json.loads(f.read()) # cookie 读取发送 for cookie in listCookies: # print(cookie) browser.add_cookie({ 'domain': '.taobao.com', # 此处xxx.com前，需要带点 'name': cookie['name'], 'value': cookie['value'], 'path': '/', 'expires': None })

4、解析网页进行数据获取

# 获取输入框中的信息 thing =E1.get('1.0','end')

# 设置将要采集的URL地址 url= "https://s.taobao.com/search?q=%s" # 设置采集的商品名称 browser.get(url%thing) # 窗口最小化 browser.minimize_window() # 获取商品总页数 page_count = browser.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div[26]/div/div/div/div[1]').text page_count = int(page_count.split(' ')[1]) # 设置接收字典 dic = {'real_title':[],'price':[],'payment_num':[],'provide':[],'city':[],'shop_name':[],'shop_url':[]} # 循环翻页设置 for i in range(page_count): page = i*44 browser.get(url%thing + '&s=%d'%page) div_list = browser.find_elements_by_xpath('//div[@class="ctx-box J_MouseEneterLeave J_IconMoreNew"]') # 循环遍历商品信息 for divs in div_list: # 商品标题获取 real_title = divs.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text # 商品价格获取 price = divs.find_element_by_xpath('.//div[@class="price g_price g_price-highlight"]/strong').text # 商品付款人数获取 payment_num = divs.find_element_by_xpath('.//div[@class="deal-cnt"]').text # 店家地址获取 location = divs.find_element_by_xpath('.//div[@class="row row-3 g-clearfix"]/div[@class="location"]').text # 店家名称获取 shop_name = divs.find_element_by_xpath('.//div[@class="row row-3 g-clearfix"]/div[@class="shop"]/a/span').text # 店家URL获取 shop_url = divs.find_element_by_xpath('.//div[@class="row row-3 g-clearfix"]/div[@class="shop"]/a').get_attribute('href') # 判断地址是否为自治区或直辖市 if len(location.split(' '))>1: provide=location.split(' ')[0] city=location.split(' ')[1] else: provide=location.split(' ')[0] city = location.split(' ')[0] # 将采集的数据添加至字典中 dic['real_title'].append(real_title) dic['price'].append(price) dic['payment_num'].append(payment_num.replace('+人付款','')) dic['provide'].append(provide) dic['city'].append(city) dic['shop_name'].append(shop_name) dic['shop_url'].append(shop_url) print(real_title,price,payment_num.replace('+人付款',''),provide,city,shop_name,shop_url) # 使用pandas将获取的数据写入csv文件持久化存储 df=pd.DataFrame(dic) df.to_csv('C:/Users/admin/Desktop/'+thing.strip('\n')+'.csv') browser.close()

截止至此基本完成

发现这样的数据写入是不会保存的所以要添加一个提示框来终止get_data函数的运行

def warning(): # 弹出对话框 result = tkinter.messagebox.showinfo(title = 'success！',message='主人！数据获取完成') # 返回值为：ok

在get_data函数中嵌套warning函数.

—–完活下班！！！！—–

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-28

# Python爬虫

复制链接

赏

selenium实战：窗口化爬取*宝数据

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置