python简单小爬虫

973次阅读

没有评论

最近小组内部需要做一个简单小分享，不知道要分享什么，最后决定要做一次爬虫的小分享，哈哈，我也是一个初学者，于是就开始找资料，这里就把我一个简单小分享在这里描述一下

首先，我们要知道什么是爬虫，我的理解是：用代码模拟人的操作，去其他网站找到需要的东西，然后爬取下来

所以就需要先知道要爬取内容的网站地址，然后才能去爬取

这里是一个简单小爬虫：

python简单小爬虫

#1、简单爬虫,不做任何处理

import requests　　#导入用来爬虫的包

URL="https://www.baidu.com"　　#请求地址

req = requests.get(URL)　　#开始请求

print(req.text)　　#把请求的数据打印

View Code

这里是直接爬取了一个网页，没有做任何的处理，这肯定是不行的，这样数据还是需要我们自己从里面找。

所以这又把代码稍微改进了一点

python简单小爬虫

2、简单处理，抓取自己想要的东西importrequestsfrom bs4 importBeautifulSoup

URL= "https://www.biqiuge.com/paihangbang/"req= requests.get(url=URL)

req_txt=req.text

bs= BeautifulSoup(req_txt, "html.parser")for i in bs.find_all("div", class_="block bd"):print(i.text)

View Code

这里是爬取了笔趣阁的小说的排行，这样的数据其实我们就可以直接看懂了，直接可以使用了

你以为就这样结束了，不不不，没那么简单

总所周知，很多网站是反爬取的，这样我们就需要做一下简单的处理了，例如知乎网，我们像上面那样直接爬取就是不行的

所以，我们加入了一个请求头，其他更复杂的反爬取这里就不讲了

python简单小爬虫

#3、携带请求头#部分网站直接访问不通，需要加上请求头，如：知乎:

importrequests#请求头字典

headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}#在get请求内，添加user-agent

response = requests.get(url='https://www.zhihu.com/explore', headers=headers)print(response.status_code) #200

print(response.text)

with open('zhihu.html', 'w', encoding='utf-8') as f: #创建一个zhihu.html文件

f.write(response.text)

View Code

那么文字的爬取就到这里了，其他的更深的我也不会

下面我在讲一下图片的爬取

python简单小爬虫

#简单的爬取图片，只爬取一张

importrequests#这是一个图片的url

url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1596560238680&di=0b9fd395e131fc1df9d992b1c33f3c70&imgtype=0&src=http%3A%2F%2Ft7.baidu.com%2Fit%2Fu%3D3616242789%2C1098670747%26fm%3D79%26app%3D86%26f%3DJPEG%3Fw%3D900%26h%3D1350'response=requests.get(url)#获取的文本实际上是图片的二进制文本

img =response.content#将他拷贝到本地文件 w 写 b 二进制 wb代表写入二进制文本

with open('./a.jpg', 'wb') as f:

f.write(img)

View Code

这个是一个简单爬取图片的代码，而且一次只能爬取一张，爬取之后再换URL，在爬取，还不如手动下载来的快

所以批量下载就来了

python简单小爬虫

importreimportrequestsdefdownload(html):#通过正则匹配

pic_url = re.findall('"objURL":"(.*?)",', html, re.S)

i= 1

for key inpic_url:print("开始下载图片：" + key + "\r\n")try:

pic= requests.get(key, timeout=10)exceptrequests.exceptions.ConnectionError:print('图片无法下载')continue

#保存图片路径

dir = '保存路径' + str(i) + '.jpg'fp= open(dir, 'wb')

fp.write(pic.content)

fp.close()

i+= 1

defmain():

url= "https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1596723403215_R&pv=&ic=0&nc=1&z=&hd=&latest=&copyright=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&sid=&word=%E5%88%98%E4%BA%A6%E8%8F%B2"result=requests.get(url)

download(result.text)if __name__ == '__main__':

main()

View Code

图片爬取也就这样了

下面我们来一下综合的，就是文字和图片都要爬取的，那就爬取一个豆瓣的电影排行榜吧

python简单小爬虫

importjsonimportosimportrequestsfrom bs4 importBeautifulSoup#请求头

headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}#请求URL

url = "https://movie.douban.com/ithil_j/activity/widget/987"req= requests.get(url=url, headers=headers)#进行Unicode解码

req_txt = req.text.encode('utf8').decode('unicode_escape')#把接口返回的结果转换为dict

dict_req =json.loads(req_txt)#取出subjes

sub_list = dict_req['res']['subjects']#把subjes转换为dict

dd = {i: v for i, v inenumerate(sub_list)}for j indd.keys():

title= dd[j]['title']

cover= dd[j]['cover']

rating= dd[j]['rating']

url_sub= dd[j]['url']

req_cover= requests.get(url=cover)

req_rating= requests.get(url=url_sub, headers=headers)#获取的文本实际上是图片的二进制文本

img =req_cover.content#创建文件夹

b = os.getcwd() #返回当前进程的工作目录

path = b + "/movie/" + str(rating) +title

ispath= os.path.exists(path) #判断一个目录是否存在

if ispath ==False:

os.makedirs(path)#创建目录

os.makedirs(path + "/" + title + "演员")

img_path= os.path.exists(path + "/" + title + '.jpg')#将他拷贝到本地文件 w 写 b 二进制 wb代表写入二进制文本

if img_path ==False:

with open(path+ "/" + title + '.jpg', 'wb') as f:

f.write(img)#找到简介的页面标签

bf = BeautifulSoup(req_rating.text, "html.parser")

text2= bf.find('span', property='v:summary')#把简介写入TXT文档

file_handle = open(path + "/" + title + '-简介.txt', mode='w', encoding='utf-8')

file_handle.write(text2.text)

movie_title= bf.findAll('li', class_='celebrity')for m inmovie_title:#movie_img = bf.findAll('div', class_='avatar')

div_txt = m.findChild('div')

url_movie= div_txt['style']

url_mov= url_movie[22:-1]

mov_img=requests.get(url_mov).content

a_txt= m.findChild('a')

name= a_txt['title']

with open(path+ "/" + title + "演员/" + name + '.jpg', 'wb') as f:

f.write(mov_img)

View Code

好了，本次的小分享就这样结束了

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-28

# Python爬虫

复制链接

赏

python简单小爬虫

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

什么是代理服务器IP：如何选择合适的

HTTP代理设置详解：一步步配置指南

海外静态IP的代理选择与配置