Scrapy爬虫urlparse之urljoin() 必备

1,386次阅读

首先导入模块，用help查看相关文档

>>> from urllib import parse >>> help(parse.urljoin) Help on function urljoin in module urlparse:

urljoin(base, url, allow_fragments=True) Join a base URL and a possibly relative URL to form an absolute interpretation of the latter.

意思就是将基地址与一个相对地址形成一个绝对地址，然而讲的太过抽象

>>> urljoin("http://www.google.com/1/aaa.html","bbbb.html") 'http://www.google.com/1/bbbb.html' >>> urljoin("http://www.google.com/1/aaa.html","2/bbbb.html") 'http://www.google.com/1/2/bbbb.html' >>> urljoin("http://www.google.com/1/aaa.html","/2/bbbb.html") 'http://www.google.com/2/bbbb.html' >>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/3/ccc.html") 'http://www.google.com/3/ccc.html' >>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/ccc.html") 'http://www.google.com/ccc.html' >>> urljoin("http://www.google.com/1/aaa.html","javascript:void(0)") 'javascript:void(0)'

规律不难发现，但是并不是万事大吉了，还需要处理特殊情况，如链接是其本身，链接中包含无效字符等

url = urljoin("****","****") ### find()查找字符串函数，如果查到：返回查找到的第一个出现的位置。否则，返回-1 if url.find("'")!=-1: continue ### 只取井号前部分 url = url.split('#')[0] ### 这个isindexed()是我自己定义的函数，判断该链接不在保存链接的数据库中 if url[0:4]=='http' and not self.isindexed(url): ###newpages = set(),无序不重复元素集 newpages.add(url)

原文地址：https://www.cnblogs.com/phil-chow/p/5347947.html

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

Scrapy爬虫urlparse之urljoin() 必备

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置