爬虫爬取多个不相同网页

432次阅读
没有评论
爬虫爬取多个不相同网页

任务四

‘’’
本任务要求大家完成一个简单的爬虫项目,包括网页爬取、信息提取以及数据保存
在完成本次任务时,建议大家认真思考,结合自己的逻辑,完成任务。
注意:本任务的得分将按照任务提交时间的先后顺序与任务正确率结合来计算,
由于每位同学的题目都不相同,建议不要抄袭,一旦发现抄袭情况,本次任务判为0分’’’
from typing import Any, Tuple

‘’’
第一题:请使用爬虫技术,爬取以下5个url地址的网页信息,并进行关键信息提取。
从爬取到的页面源码中提取下列4种信息:
1.文章标题
2.正文内容(注意,只提取文章的文本内容,不得提取页面中其他无关的文本内容)
3.图片链接(如果有)
4.时间、日期(如果有)’’’
#你分配到的url为:url = [‘http://fashion.cosmopolitan.com.cn/2019/1020/287733.shtml’,‘http://dress.pclady.com.cn/style/liuxing/1003/520703.html’,‘http://www.smartshe.com/trends/20191009/56414.html’,‘https://dress.yxlady.com/202004/1560779.shtml’,‘http://www.yoka.com/fashion/roadshow/2019/0513/52923401100538.shtml’]
url1 =‘http://fashion.cosmopolitan.com.cn/2019/1020/287733.shtml’;url2 = ‘http://dress.pclady.com.cn/style/liuxing/1003/520703.html’;url3 = ‘http://www.smartshe.com/trends/20191009/56414.html’;url4 = ‘https://dress.yxlady.com/202004/1560779.shtml’;url5 = ‘http://www.yoka.com/fashion/roadshow/2019/0513/52923401100538.shtml’
headers={‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36’,}
import requests
from bs4 import BeautifulSoup
def get_url1(url,data=None):
url = requests.get(url1, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk,将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’,title)
body_text = soup.find_all(class_=‘p2’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_=‘time’)
for body_text in body_text:
body_text = body_text.string
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间:’,time)
print(’—’*50)
def get_url2(url,data=None):
url = requests.get(url2, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk,将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘2:文章标题’, title)
body_text = soup.find_all(class_=‘artText’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_=‘time’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间’,time)
print(’—’ * 50)
def get_url3(url,data=None):
url = requests.get(url3, headers=headers)
url.encoding = ‘utf-8’ # 页面编码为utf-8,将编码方式转换为utf-8
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’,title)
body_text = soup.find_all(class_=‘art-body’)
body_image = soup.find_all(‘img’)
time = soup.select(’.art-auther > span:nth-child(1)’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间:’,time)
print(’—’*50)
def get_url4(url,data=None):
url = requests.get(url4, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk,将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’, title)
body_text = soup.select(’.left1 > div.ArtCon > p’)
body_image = soup.find_all(‘img’)
time = soup.select(’#acxc > span’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’, body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’, body_image)
for time in time:
time = time.text
print(‘时间:’, time)
print(’—’ * 50)
def get_url5(url,data=None):
url = requests.get(url5, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk,将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’, title)
body_text = soup.find_all(class_= ‘textCon’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_= ‘time’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’, body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’, body_image)
for time in time:
time = time.text
print(‘时间:’, time)
print(’—’ * 50)

with open(“record.json”,‘w’, encoding=‘utf-8’) as f:
f.write(str(data))
print(“加载入文件完成…”)

最后函数的调用没有处理好,大家仅供参考

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

相关文章:

版权声明:Python教程2022-10-25发表,共计3705字。
新手QQ群:570568346,欢迎进群讨论 Python51学习