python爬虫简单示例

513次阅读

准备工作：安装python3环境、beautifulsoup4库（https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id7）

from urllib import request req = request.urlopen("http://www.baidu.com") print(req.read().decode("utf-8"))

(目的是不让服务器认为是爬虫，若不带此浏览器信息，则可能会报错)

req = request.Request(url) #此处url为某个网址 req.add_header(key,value) #key即user-Agent，value即浏览器的版本信息 resp = request.urlopen(req) print(resp.read().decode("utf-8"))

导入urllib库下面的parse

from urllib import parse

使用urlencode生成post数据

postData = parse.urlencode([ (key1,val1), (key2,val2), (keyn,valn) ])

使用post

request.urlopen(req,data=postData.encode("utf-8")) #使用postData发送post请求 resp.status #得到请求状态 resp.reason #得到服务器的类型

#-*- coding:utf-8 -*- from bs4 import BeautifulSoup as bs from urllib.request import urlopen import re import ssl

#获取维基百科词条信息 ssl._create_default_https_context = ssl._create_unverified_context #全局取消证书验证

#请求URL，并把结果用utf-8编码 req = urlopen("https://en.wikipedia.org/wiki/Main page").read().decode("utf-8") #使用beautifulsoup去解析 soup = bs(req,"html.parser")

# print(soup) #获取所有href属性以“/wiki/Special”开头的a标签 urllist = soup.findAll("a",href=re.compile("^/wiki/Special")) for url in urllist: #去除以.jpg或.JPG结尾的链接 if not re.search(".(jpg|JPG)$",url["href"]): #get_test()输出标签下的所有内容，包括子标签的内容； #string只输出一个内容，若该标签有子标签则输出“none print(url.get_text()+"—–>"+url["href"]) # print(url)

通过pip安装：
$ pip install pymysql
或者通过安装文件：
$ python setup.py install

#引入开发包 import pymysql.cursors #获取数据库链接 connection = pymysql.connect(host="localhost", user = 'root', password = '123456', db ='wikiurl', charset = 'utf8mb4') try: #获取会话指针 with connection.cursor() as cursor #创建sql语句 sql = "insert into `tableName`(`urlname`,`urlhref`) values(%s,%s)" #执行SQL语句 cursor.execute(sql,(url.get_text(),"https://en.wikipedia.org"+url["href"])) #提交 connection.commit() finally: #关闭 connection.close()

Robots协议（机器人协议，也称爬虫协议）全称是“网络爬虫排除协议”，网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不可以抓取。一般在主页面下，如https://en.wikipedia.org/robots.txt

Disallow：不允许访问 allow：允许访问

参考：慕课网课程https://www.imooc.com/learn/712

2018年9月27日

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-28

# Python爬虫

复制链接

赏

python爬虫简单示例

1、urllib和BeautifuSoup

获取浏览器信息

模拟真实浏览器：携带user-Agent头

使用POST

完整代码示例（以爬取维基百科首页链接为例）

2、存储数据到MySQL

安装pymysql

使用

3、爬虫注意事项

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置