爬虫简单使用

461次阅读

一、常识

import requests # 模块作用：伪造浏览器请求 response = requests.get(访问的url) from bs4 import BeautifulSoup # 将html的内容解析成对象 bs4 = BeautifulSoup(response.text, ‘html.parser‘) # 查找的方法 bs4.find(name=‘标签名‘, attrs={‘属性名：‘属性值’}) # find_all查找全部 # 获取内容 # content 原始内容用于获取bytes数据类型(图片、视频) # text 获取

二、示例

import requests from bs4 import BeautifulSoup import os # path = os.path.join(os.getcwd(), ‘img‘) # 1.伪造浏览器请求 response = requests.get(“……“) response.encoding = ‘gbk‘ # 2.获取网页的html文件 # print(response.text) # 3.使用bs4将html文件解析成对象 bs4 = BeautifulSoup(response.text, ‘html.parser‘) # print(bs4) div = bs4.find(name=‘div‘, attrs={‘id‘: ‘auto-channel-lazyload-article‘}) # print(div) li_list = div.find_all(name=‘li‘) for li in li_list: print(‘=‘*120) # print(li) h3 = li.find(name=‘h3‘) if not h3: continue print(h3.text) a = li.find(name=‘a‘) href = a.get(‘href‘) print(‘https:{}‘.format(href)) img = li.find(name=‘img‘) src = img.get(‘src‘) src = ‘https:{}‘.format(src) print(src) file_name = src.rsplit(‘/‘, maxsplit=1)[1] # print(file_name) file_path = os.path.join(path, file_name) # print(file_path) # src是地址，重新伪造get请求 ret = requests.get(src) # content是获取原始的数据 # print(ret.content) # 保存图片 with open(file_path, ‘wb‘) as f: f.write(ret.content)

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-11-01

# Python爬虫

复制链接

赏

爬虫简单使用

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置

HTTP代理设置详解：一步步配置指南

海外静态IP的代理选择与配置

动态与静态代理IP的区别解析

Socks5代理配置教程及注意事项

如何找到可靠的免费代理服务器

什么是代理服务器IP：如何选择合适的

静态代理IP怎么填写：步骤与示例

在线代理服务器的使用与推荐

什么是Socks5代理IP及其优势

国外代理服务器的优势及选择建议

爬虫 简单使用

相关文章：

爬虫简单使用