python中HTMLParser模块是什么？

684次阅读

本章内容，我们主要来讲一下Python内置的HTML解析库HTMLParser模块，基本上也是应用于页面抓取上，假设，我们需要去收集页面上已存在的静态链接，但是页面肯定代码量都非常大，并且页面也很多，这样看来，会比较麻烦，工作量也非常大，这个时候，我们就可以用到htmlparser模块，一起来了解具体使用内容。

安装：

npm install htmlparser

htmlparser提供构造函数：

function Parser(handler) {
   this._handler = handler;
}

HTMLParser解析HTML:

from html.parser import HTMLParser
from html.entities import name2codepoint
 
class MyHTMLParser(HTMLParser):
 
    def handle_starttag(self, tag, attrs):
        print('<%s>' % tag)
 
    def handle_endtag(self, tag):
        print('</%s>' % tag)
 
    def handle_startendtag(self, tag, attrs):
        print('<%s/>' % tag)
 
    def handle_data(self, data):
        print(data)
 
    def handle_comment(self, data):
        print('<!--', data, '-->')
 
    def handle_entityref(self, name):
        print('&%s;' % name)
 
    def handle_charref(self, name):
        print('&#%s;' % name)
 
parser = MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p>Some <a href="#">html</a> HTML tutorial...<br>END</p>
</body></html>''')

HTML本质上是XML的子集，但是HTML的语法没有XML那么严格，大家也可以尝试利用HTMLParser解析HTML。

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2021-07-18

# 模块使用

复制链接

赏

python中HTMLParser模块是什么？

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置