Python网络爬虫数据采集实战（二）：Requests和Re库

710次阅读

没有评论

熟悉爬虫的基本概念之后，我们可以直接开始爬虫实战的学习，先从Python的requests库即re库入手，可以迅速“get”到python爬虫的思想以及流程，并且通过这两个库就可以建立一个完整的爬虫系统。

一、requests库

1.简介

二、re库

1.简介

2.入门测试

3.主要方法

一、requests库

1.简介

Requests是用Python语言编写的，基于urllib3来改写的，采用Apache2 Licensed 来源协议的HTTP库。它比urllib更加方便，可以节约我们大量的工作，完全满足HTTP测试需求。在日常使用中我们绝大部分使用requests库向目标网站发起HTTP请求。

Python网络爬虫数据采集实战（二）：Requests和Re库

通过上图官网对requests的介绍可知此库的强大之处：Requests是唯一适用于Python的Non-GMO HTTP库，可供人类安全使用。

2.入门测试

我们首先通过代码实例测试一下Requests库的使用情景。首先本文采用配置环境为win10+anaconda3+Python3.7.4，直接在终端运行：

pip install requests

如果出现以下字样即代表安装完成。

Python网络爬虫数据采集实战（二）：Requests和Re库

urllib 库中的urlopen()方法实际上是以GET方式请求网页，而requests 中相应的方法就是get()。在Python中运行以下代码：

import requests # 以get方式获取百度官网源代码 res = requests.get("https://www.baidu.com") # 获取返回类型 print(type(res)) # 获取状态码 print(res.status_code) # 获取返回源代码内容类型 print(type(res.text)) # 获取前15字符 print((res.text)[:15]) # 获取cookies print(res.cookies)

输出结果为：

<class 'requests.models.Response'> 200 # 状态码200代表响应正常 <class 'str'> <!DOCTYPE html> <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>

3.主要方法

requests库的主要方法有以下7种，接下来就几种常用方法进行简单介绍。

方法	说明
requests.get()	获取HTML网页的主要方法，对应于HTTP的GET
requests.head()	获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post()	向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch()	向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete()	向HTML页面提交删除请求，对应于HTTP的DELETE

get方法是我们通常最常用的方法。输入如下代码对网站提交get请求：

import requests res = requests.get("http://httpbin.org/get") print(res.text)

输出结果为：

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0", "X-Amzn-Trace-Id": "Root=1-5e5dd355-96131363cf818957b8e7b67d" }, "origin": "171.112.101.74", "url": "http://httpbin.org/get" }

由上述输出可知响应结果包含请求头、URL和IP等信息。而如果我们想在get请求中输入参数信息，则需要设置params参数：

import requests data = { 'building':"zhongyuan", 'nature':"administrative" } res = requests.get("http://httpbin.org/get",params=data) print(res.text)

输出内容为：

{ "args": { "building": "zhongyuan", "nature": "administrative" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0", "X-Amzn-Trace-Id": "Root=1-5e5dd4db-f0568e98f3350cae8998968a" }, "origin": "171.112.101.74", "url": "http://httpbin.org/get?building=zhongyuan&nature=administrative" }

由上可知在get请求中成功将参数传递进去。此外，上述返回格式不仅是字符串格式，还是json文件格式，因此我们可以通过Python中json库对返回信息进行解析：

import requests res = requests.get("http://httpbin.org/get") print(type(res.text)) print(res.json()) print(type(res.json()))

输出结果为：

<class 'str'> {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'X-Amzn-Trace-Id': 'Root=1-5e5dd5e5-b195baec1c11b51c03eee96c'}, 'origin': '171.112.101.74', 'url': 'http://httpbin.org/get'} <class 'dict'>

为了将 Requests 发起的 HTTP 请求伪装成浏览器，我们通常是使用headers关键字参数。headers 参数同样也是一个字典类型。具体用法见以下代码：

import requests headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } res = requests.get("http://httpbin.org/get",headers=headers) print(res.text)

输出结果如下，可以看出在headers参数中我们的"User-Agent"发生了改变，而不再是之前暴露的requests了，这对于一些对爬虫有限制的网站似乎很有用。

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", "X-Amzn-Trace-Id": "Root=1-5e5dd68c-e185889878974c88aff8d704" }, "origin": "171.112.101.74", "url": "http://httpbin.org/get" }

data 参数通常结合 POST 请求方式一起使用。如果我们需要用 POST 方式提交表单数据或者JSON数据，我们只需要传递一个字典给 data 参数。

import requests data = { 'user': 'admin', 'pass': 'admin' } res = requests.post('http://httpbin.org/post', data=data) print(res.text)

获取结果如下：

{ "args": {}, "data": "", "files": {}, "form": { "pass": "admin", "user": "admin" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "21", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0", "X-Amzn-Trace-Id": "Root=1-5e5dd775-932576d4bdcad64891fb54fa" }, "json": null, "origin": "171.112.101.74", "url": "http://httpbin.org/post" }

我们使用代理发起请求，经常会碰到因代理失效导致请求失败的情况。因此，我们对请求超时做下设置。当发现请求超时，更换代理再重连。

# 设置3s超时断连 res = requests.get(url, timeout=3) # 传入元组参数，分别设置断连超时时间与读取超时时间 response = requests.get(url, timeout=(3, 30))

二、re库

1.简介

正则表达式是一个特殊的字符序列，能方便的检查一个字符串是否与某种模式匹配。re模块使得python拥有全部的正则表达式功能。在爬虫自动化程序中，re库充当信息提取的角色，通过re库我们可以从源代码中批量精确匹配到想要的信息。

2.入门测试

开源中国提供的正则表达式匹配网站可以供我们很好的练手测试（网址：https://tool.oschina.net/regex/#）。下文我们先输入一段测试文本，再选择匹配Email地址，正则表达式为：[\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?。

Python网络爬虫数据采集实战（二）：Requests和Re库

是不是像一串“乱码”？实际上这里面每一个“乱码”都有具体的意义，具体可参照下面的对照：

\w 匹配字母数字及下划线 \W 匹配f非字母数字下划线 \s 匹配任意空白字符，等价于[\t\n\r\f] \S 匹配任意非空字符 \d 匹配任意数字 \D 匹配任意非数字 \A 匹配字符串开始 \Z 匹配字符串结束，如果存在换行，只匹配换行前的结束字符串 \z 匹配字符串结束 \G 匹配最后匹配完成的位置 \n 匹配一个换行符 \t 匹配一个制表符 ^ 匹配字符串的开头 $ 匹配字符串的末尾 . 匹配任意字符，除了换行符，re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符 [….] 用来表示一组字符，单独列出：[amk]匹配a,m或k [^…] 不在[]中的字符：[^abc]匹配除了a,b,c之外的字符 * 匹配0个或多个的表达式 + 匹配1个或者多个的表达式 ? 匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式 {n} 精确匹配n前面的表示 {m,m} 匹配n到m次由前面的正则表达式定义片段，贪婪模式 a|b 匹配a或者b () 匹配括号内的表达式，也表示一个组

3.主要方法

match函数

函数原型：match(pattern, string, flags=0)

尝试从字符串的起始位置匹配一个模式，如果起始位置没匹配上的话，返回None

import re content= "hello 123 4567 World_This is a regex Demo" result = re.match('^hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content) print(result) print(result.group()) #获取匹配的结果 print(result.span()) #获取匹配字符串的长度范围

输出结果为：

<re.Match object; span=(0, 41), match='hello 123 4567 World_This is a regex Demo'> hello 123 4567 World_This is a regex Demo (0, 41)

通用匹配

上面的代码正则表达式太复杂，其实完全没必要这么做，因为还有一个万能匹配可以用，那就是.＊（点星）。其中.（点）可以匹配任意字符（除换行符），＊（星）代表匹配前面的字符无限次，所以它们组合在一起就可以匹配任意字符了。因此我们可以使用下面的方式进行简化：

content= "hello 123 4567 World_This is a regex Demo" result = re.match('^hello.*Demo$',content) print(result) print(result.group()) print(result.span())

输出结果与前文相同：

<re.Match object; span=(0, 41), match='hello 123 4567 World_This is a regex Demo'> hello 123 4567 World_This is a regex Demo (0, 41)

分组匹配

为了匹配字符串中具体的目标，可以使用（）进行分组匹配

content= "hello 123 4567 World_This is a regex Demo" result = re.match('^hello\s(\d+).*Demo$',content) print(result.group()) print(result.group(1))

输出group组中的第一个结果：

hello 123 4567 World_This is a regex Demo 123

贪婪匹配

简要说意思就是一直匹配，匹配到匹配不上为止。

content= "hello 123 4567 World_This is a regex Demo" result = re.match('^hello.*(?P<name>\d+).*Demo$',content) print(result.group()) print(result.group(1)) print(result.groupdict())

结果如下，最终结果输出的是7，出现这样的结果是因为被前面的.*给匹陪掉了，只剩下了一个数字，这就是贪婪匹配。

hello 123 4567 World_This is a regex Demo 7 {'name': '7'}

若要非贪婪匹配可以使用问号（？）：

content= "hello 123 4567 World_This is a regex Demo" result = re.match('^hello.*?(?P<name>\d+).*Demo$',content) print(result.group()) print(result.group(1)) print(result.groupdict())

这样就可以得到123的结果了：

hello 123 4567 World_This is a regex Demo 123 {'name': '123'}

函数中添加匹配模式

　　def match(pattern, string, flags=0) 第三个参数flags设置匹配模式

　　re.I：使匹配对大小写不敏感

　　re.L：做本地化识别匹配

　　re.S：使.包括换行在内的所有字符

　　re.M：多行匹配，影响^和$

　　re.U：使用unicode字符集解析字符，这个标志影响\w,\W,\b,\B

　　re.X：将正则表达式写得更易于理解

例如通过设置匹配模式为re.I，使得使匹配对大小写不敏感：

content= "heLLo 123 4567 World_This is a regex Demo" result = re.match('hello',content,re.I) print(result.group())

结果如下，仍然会输出heLLo：

heLLo

search函数

函数原型：def search(pattern, string, flags=0)

扫描整个字符串，返回第一个匹配成功的结果

content= '''hahhaha hello 123 4567 world''' result = re.search('hello.*world',content) print(result.group())

输出：

hello 123 4567 world

findall函数

函数原型：def findall(pattern, string, flags=0)。搜索字符串，以列表的形式返回所有能匹配的字串

content= ''' <url>http://httpbin.org/get</url> <url>http://httpbin.org/post</url> <url>https://www.baidu.com</url>''' urls = re.findall('<url>(.*)</url>',content) for url in urls: print(url)

以上命令将会输出所有符合条件的字符串即链接：

http://httpbin.org/get http://httpbin.org/post https://www.baidu.com

sub函数

函数原型：def subn(pattern, repl, string, count=0, flags=0)。替换字符串中每一个匹配的子串后返回替换后的字符串

content= '''hello 123 4567 world''' str = re.sub('123.*world','future',content) print(str)

输出结果就会将123后面的内容替换成'future'：

hello future

compile

函数原型：def compile(pattern, flags=0)。将正则表达式编译成正则表达式对象，方便复用该正则表达式

content= '''hello 123 4567 world'''pattern = '123.*world' regex = re.compile(pattern) str = re.sub(regex,'future',content) print(str)

输出结果同上文一样：

hello future

有关requests库和re库的简单介绍和使用到此结束，下一篇将利用这两个库行网络数据爬取实战。基础知识可参考上篇：

Python网络爬虫数据采集实战（一）：基础知识

Python网络爬虫数据采集实战（二）：Requests和Re库

我就知道你“在看”

Python网络爬虫数据采集实战（二）：Requests和Re库

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-27

# Python爬虫

复制链接

赏

Python网络爬虫数据采集实战（二）：Requests和Re库

2.入门测试

3.主要方法

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置

HTTP代理服务器的设置及应用实例

如何找到可靠的免费代理服务器

动态与静态代理IP的区别解析

什么是代理服务器IP：如何选择合适的

海外静态IP的代理选择与配置

静态代理IP怎么填写：步骤与示例

在线代理服务器的使用与推荐

什么是Socks5代理IP及其优势

HTTP代理设置详解：一步步配置指南

Socks5代理配置教程及注意事项