爬虫爬取多个不相同网页

621次阅读

‘’’
本任务要求大家完成一个简单的爬虫项目，包括网页爬取、信息提取以及数据保存
在完成本次任务时，建议大家认真思考，结合自己的逻辑，完成任务。
注意：本任务的得分将按照任务提交时间的先后顺序与任务正确率结合来计算，
由于每位同学的题目都不相同，建议不要抄袭，一旦发现抄袭情况，本次任务判为0分’’’
from typing import Any, Tuple

‘’’
第一题：请使用爬虫技术，爬取以下5个url地址的网页信息，并进行关键信息提取。
从爬取到的页面源码中提取下列4种信息：
1.文章标题
2.正文内容（注意，只提取文章的文本内容，不得提取页面中其他无关的文本内容）
3.图片链接（如果有）
4.时间、日期（如果有）’’’
#你分配到的url为：url = [‘http://fashion.cosmopolitan.com.cn/2019/1020/287733.shtml’,‘http://dress.pclady.com.cn/style/liuxing/1003/520703.html’,‘http://www.smartshe.com/trends/20191009/56414.html’,‘https://dress.yxlady.com/202004/1560779.shtml’,‘http://www.yoka.com/fashion/roadshow/2019/0513/52923401100538.shtml’]
url1 =‘http://fashion.cosmopolitan.com.cn/2019/1020/287733.shtml’;url2 = ‘http://dress.pclady.com.cn/style/liuxing/1003/520703.html’;url3 = ‘http://www.smartshe.com/trends/20191009/56414.html’;url4 = ‘https://dress.yxlady.com/202004/1560779.shtml’;url5 = ‘http://www.yoka.com/fashion/roadshow/2019/0513/52923401100538.shtml’
headers={‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36’,}
import requests
from bs4 import BeautifulSoup
def get_url1(url,data=None):
url = requests.get(url1, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’,title)
body_text = soup.find_all(class_=‘p2’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_=‘time’)
for body_text in body_text:
body_text = body_text.string
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间:’,time)
print(’—’*50)
def get_url2(url,data=None):
url = requests.get(url2, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘2:文章标题’, title)
body_text = soup.find_all(class_=‘artText’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_=‘time’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间’,time)
print(’—’ * 50)
def get_url3(url,data=None):
url = requests.get(url3, headers=headers)
url.encoding = ‘utf-8’ # 页面编码为utf-8，将编码方式转换为utf-8
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’,title)
body_text = soup.find_all(class_=‘art-body’)
body_image = soup.find_all(‘img’)
time = soup.select(’.art-auther > span:nth-child(1)’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间:’,time)
print(’—’*50)
def get_url4(url,data=None):
url = requests.get(url4, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’, title)
body_text = soup.select(’.left1 > div.ArtCon > p’)
body_image = soup.find_all(‘img’)
time = soup.select(’#acxc > span’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’, body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’, body_image)
for time in time:
time = time.text
print(‘时间:’, time)
print(’—’ * 50)
def get_url5(url,data=None):
url = requests.get(url5, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’, title)
body_text = soup.find_all(class_= ‘textCon’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_= ‘time’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’, body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’, body_image)
for time in time:
time = time.text
print(‘时间:’, time)
print(’—’ * 50)

with open(“record.json”,‘w’, encoding=‘utf-8’) as f:
f.write(str(data))
print(“加载入文件完成…”)

最后函数的调用没有处理好，大家仅供参考

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

爬虫爬取多个不相同网页

任务四

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置