爬虫技术知多少

770次阅读

一、爬取python之禅
了解一个网络爬虫程序的最普遍的过程：
1.访问站点
2.找到需要的信息，并且定位好
3.获得信息后，进行处理

show the code

import requests url = 'https://www.python.org/dev/peps/pep-0020/' res = requests.get(url) text = res.text text

看一下结果
爬虫技术知多少
可以看到返回的其实就是开发者工具下Elements的内容，只不过是字符串类型，接下来我们要用python的内置函数find来定位“python之禅”的索引，然后从这段字符串中取出它
通过观察网站，我们可以发现这段话在一个特殊的容器中，通过审查元素，使用快捷键Ctrl+shift+c快速定位到这段话也可以发现这段话包围在pre标签中，因此我们可以由这个特定用find函数找出具体内容

#将爬取内容存放在txt文档里 with open('zon_of_python.txt', 'w') as f: f.write(text[text.find('<pre')+28:text.find('</pre>')]) #这里的+28指的是从<pre开始定位往后28位就是我们要的文档 print(text[text.find('<pr')+28:text.find('</pre>')–1])

爬虫技术知多少
接下来，我们用金山词霸来翻译我们刚刚爬出来的python之禅
我们先以金山词霸为例，有道翻译百度翻译谷歌翻译都有加密，以后可以自己尝试。
首先进入金山词霸首页http://www.iciba.com/
然后打开开发者工具下的“Network”，翻译一段话，比如刚刚我们爬到的第一句话“Beautiful is better than ugly.”
点击翻译后可以发现Name下多了一项请求方法是POST的数据，点击Preview可以发现数据中有我们想要的翻译结果
爬虫技术知多少

import requests def translate(word): url = 'http://fy.iciba.com/ajax.php?a=fy'

data = { 'f':'auto', 't':'auto', 'w':word, } headers ={ 'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36', }#user-agent会告诉网站服务器，访问者是通过什么工具来请求的，如果是爬虫请求，一般会拒绝，如果是用户浏览器请求就会应答 response = requests.post(url, data=data,headers=headers)#发起请求 json_data = response.json()#获取json数据 return json_data def run(word): result = translate(word)['content']['out'] print(result) return result def main(): with open('zon_of_python.txt')as f: zh = [run(word) for word in f] with open('zon_of_python_zh-CN.txt','w') as g: for i in zh: g.write(i + '\n')

if __name__== '__main__': main()

爬虫技术知多少
二、爬取豆瓣电影top250电影名称和图片
当我们打开https://movie.douban.com/top250时，发现电影每个页面只显示25个，要爬取top250这是个动态的过程需要看一下接下来url的变化。
https://movie.douban.com/top250?start=’+ str() +’&filter=’
这个就可以看出都豆瓣页面的变化了。
当然也要查看一下，我们所要信息的定位。
废话不多说，上代码：

import requests import os if not os.path.exists('image'): os.mkdir('image') def parse_html(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' } res = requests.get(url, headers=headers) text = res.text item = [] for i in range(25): text = text[text.find('alt')+3:] item.append(extract(text)) return item def extract(text): text = text.split('"') name = text[1] image = text[3] return name, image def write_movies_file(item, stars): print(item) with open('douban_film.txt','a', encoding='utf-8') as f: f.write('排名:%d\t 电影名：%s\n' %(stars, item[0])) r = requests.get(item[1]) with open('image/'+str(item[0])+'.jpg','wb') as f: f.write(r.content) def main(): stars = 1 for offset in range(0,250,25): url = 'https://movie.douban.com/top250?start='+str(offset)+'&filter=' for item in parse_html(url): write_movies_file(item, stars) stars += 1 if __name__ == '__main__': main()

看下输出结果：
爬虫技术知多少
ok，这样就可以大功告成了！

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-24

# Python爬虫

复制链接

赏

爬虫技术知多少

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置