爬虫——BeautifulSoup4解析器
BeautifulSoup用来解析HTML比较简单,API非常人性化,支持CSS选择器、Python标准库中的HTML解析器,也支持lxml的XML解析器。
其相较与正则而言,使用更加简单。
示例:
首先必须要导入bs4库
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 创建 Beautiful Soup 对象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 格式化输出 soup 对象的内容 print(soup.prettify())
运行结果
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --> </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
四大对象种类
BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
(1)Tag
(2)NavigableString
(3)BeautifulSoup
(4)Comment
1.Tag
Tag 通俗点讲就是HTML中的一个个标签,例如:
<head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
上面title head a p 等等HTML标签加上里面包括的内容就是Tag,那么试着使用BeautifulSoup来获取Tags:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 创建 Beautiful Soup 对象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # # 打印title标签 print(soup.title) # 打印head标签 print(soup.head) # 打印a标签 print(soup.a) # 打印p标签 print(soup.p) # 打印soup.p的类型 print(type(soup.p))
运行结果
<title>The Dormouse's story</title> <head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <class 'bs4.element.Tag'>
我们可以利用soup加标签名轻松地获取这些标签内容,这些对象的类型是bs4.element.Tag。但是注意,它查找的是在所有内容中的第一个符合要求的标签。如果需要查询所有的标签,后面会进行介绍。
对于Tag,它有两个重要的属性,就是name和attrs。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 创建 Beautiful Soup 对象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # soup对象比较特殊,它的name为[document] print(soup.name) # 对于其他内部标签,输出的值便为标签本身的名称 print(soup.head.name) # 打印p标签的所有属性,其类型是一个字典 print(soup.p.attrs) # 打印p标签的class属性 print(soup.p['class']) # 还可以利用get方法获取属性,传入属性的名称,与上面的方法等价 print(soup.p.get('class')) print(soup.p) # 修改属性 soup.p['class'] = "newClass" print(soup.p) # 删除属性 del soup.p['class'] print(soup.p)
运行结果
[document] head {'class': ['title'], 'name': 'dromouse'} ['title'] ['title'] <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p> <p name="dromouse"><b>The Dormouse's story</b></p>
2.NavigableString
既然我们已经得到了标签的内容,那么问题来了,我们想要获取标签内部的文字怎么办呢?很简单,用.string即可,例如:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 创建 Beautiful Soup 对象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 打印p标签的内容 print(soup.p.string) # 打印soup.p.string的类型 print(type(soup.p.string))
运行结果
The Dormouse's story <class 'bs4.element.NavigableString'>
3.BeautifulSoup
BeautifulSoup对象表示的是一个文档的内容。大部分时候,可以把它当作Tag对象,是一个特殊的Tag,我们可以分别获取它的类型,名称,以及属性。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 创建 Beautiful Soup 对象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 类型 print(type(soup.name)) # 名称 print(soup.name) # 属性 print(soup.attrs)
运行结果
<class 'str'> [document] {}
4.Comment
Comment对象是一个特殊类型的NavigableString对象,其输出的内容不包括注释符号。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 创建 Beautiful Soup 对象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") print(soup.a) print(soup.a.string) print(type(soup.a.string))
运行结果
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> Elsie <class 'bs4.element.Comment'>
a标签里的内容实际上是注释,但是如果我们利用.string来输出它的内容时,注释符号已经去掉了。
神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试