网站中很多链接用的是相对路径,直接爬取会产生报错:
Missing scheme in request url: ../index.html
在python3中使用
from urllib.parse import urljoin >>> urljoin("http://www.asite.com/folder/currentpage.html", "anotherpage.html") 'http://www.asite.com/folder/anotherpage.html' >>> urljoin("http://www.asite.com/folder/currentpage.html", "folder2/anotherpage.html") 'http://www.asite.com/folder/folder2/anotherpage.html' >>> urljoin("http://www.asite.com/folder/currentpage.html", "/folder3/anotherpage.html") 'http://www.asite.com/folder3/anotherpage.html' >>> urljoin("http://www.asite.com/folder/currentpage.html", "../finalpage.html") 'http://www.asite.com/finalpage.html'将当前链接与相对路径可以自动拼接。
神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试