python外国网站爬虫

883次阅读

没有评论

前言

上一篇中我们在维基百科的内部网站上随机跳转进入文章类网页，而忽视外部网站链接。本篇文章将处理网站的外部链接并试图收集一些网站数据。和单个域名网站爬取不同，不同域名的网站结构千差万别，这就意味我们的代码需要更加的灵活以适应不同的网站结构。

因此，我们将代码写成一组函数，这些函数组合起来就可以应用在不同类型的网络爬虫需求。

随机跳转外部链接

利用函数组，我们可以在50行左右满足爬取外部网站的需求。

示例代码：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

import datetime

import random

from urllib.parse import quote

pages = set()

random.seed(datetime.datetime.now())

''' 获取一个网页的所有互联网链接'''

# 获取网页所有内部链接

def get_internal_links(soup, include_url):

internal_links = []

# find all links that befin with a '/'

print(include_url)

for link in soup.find_all('a',

href=re.compile(r'^((/|.)*' + include_url + ')')):

if link.attrs['href'] is not None:

if link.attrs['href'] not in internal_links:

internal_links.append(link.attrs['href'])

return internal_links

# retrieves a list of all external links found on a page

#获取网页上所有外部链接

def get_external_links(soup, exclude_url):

external_links = []

# Finds all links that starts with 'http' or 'www' that do not contain the

# current URL

for link in soup.find_all('a',

href=re.compile(r'^(http|www)((?!' + exclude_url + ').)*$')):

if link.attrs['href'] is not None:

if link.attrs['href'] not in external_links:

external_links.append(link.attrs['href'])

return external_links

#拆分网址获取主域名

def split_address(address):

address_parts = address.replace('http://', '').split('/')

return address_parts

#随机外部链接跳转

def get_random_external_link(starting_page):

html = urlopen(starting_page)

soup = BeautifulSoup(html, 'lxml')

external_links = get_external_links(

soup, split_address(starting_page)[0]) # find the domain URL

if len(external_links) == 0:

internal_links = get_internal_links(soup, starting_page)

print(len(internal_links))

return get_external_links(soup,

internal_links[random.randint(0, len(internal_links) – 1)])

else:

return external_links[random.randint(0, len(external_links) – 1)]

hop_count = set()

#只跳转外部链接，设置跳转次数loop, 默认跳转5次

def follow_external_only(starting_site, loop=5):

global hop_count

external_link = get_random_external_link(

quote(starting_site, safe='/:?='))

print('Random external link is: ' + external_link)

while len(hop_count) < loop:

hop_count.add(external_link)

print(len(hop_count))

follow_external_only(external_link)

follow_external_only("http://www.baidu.com")

由于代码没有异常处理和反反爬虫处理，因此一定会报错。由于跳转是随机的，可以多运行几次，有兴趣的可以根据每次的报错原因完善代码。

输出结果：

Random external link is: http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=

1

Random external link is: http://baishi.baidu.com/watch/6388818335201070269.html

2

Random external link is: http://v.baidu.com/tv/

3

Random external link is: http://player.baidu.com/yingyin.html

4

Random external link is: http://help.baidu.com/question?prod_en=player

5

Random external link is: http://home.baidu.com

[Finished in 6.3s]

抓取网页上所有外部链接

把代码写成函数的好处是可以简单地修改或者添加以满足不同的需求而不会破坏代码。比如：

目的：爬取整个网页所有外部链接并对每个链接标记

我们可以添加如下函数：

# Collects a list of all external URLs found on the site

all_ext_links = set()

all_int_links = set()

def get_all_external_links(site_url):

html = urlopen(site_url)

soup = BeautifulSoup(html, 'lxml')

print(split_address(site_url)[0])

int

internal_links = get_internal_links(soup, split_address(site_url)[0])

external_links = get_external_links(soup, split_address(site_url)[0])

for link in external_links:

if link not in all_ext_links:

all_ext_links.add(link)

print(link)

for link in internal_links:

if link not in all_int_links:

print('About to get link: ' + link)

all_int_links.add(link)

get_all_external_links(link)

# follow_external_only("http://www.baidu.com")

get_all_external_links('http://oreilly.com')

输出结果如下：

oreilly.com

https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf

http://twitter.com/oreillymedia

http://fb.co/OReilly

https://www.linkedin.com/company/oreilly-media

https://www.youtube.com/user/OreillyMedia

About to get link: https://www.oreilly.com

https:

https://www.oreilly.com

http://www.oreilly.com/ideas

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav

http://www.oreilly.com/conferences/

http://shop.oreilly.com/

http://members.oreilly.com

https://www.oreilly.com/topics

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now

https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in

https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course

https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access

http://www.oreilly.com/live-training/?view=grid

https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform

https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends

https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles

http://www.oreilly.com/about/

http://www.oreilly.com/work-with-us.html

http://www.oreilly.com/careers/

http://shop.oreilly.com/category/customer-service.do

http://www.oreilly.com/about/contact.html

http://www.oreilly.com/emails/newsletters/

http://www.oreilly.com/terms/

http://www.oreilly.com/privacy.html

http://www.oreilly.com/about/editorial_independence.html

About to get link: https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav

https:

https://www.oreilly.com/

About to get link: https://www.oreilly.com/

https:

About to get link: https://www.oreilly.com/topics

……

程序会一直循环下去直到达到python默认的循环极限，有兴趣的朋友可以像上面的代码一样添加默认循环限制loop=5。

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-24

# Python爬虫

复制链接

赏

python外国网站爬虫

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置

HTTP代理设置详解：一步步配置指南

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

HTTP代理服务器的设置及应用实例

海外静态IP的代理选择与配置

动态与静态代理IP的区别解析

静态代理IP怎么填写：步骤与示例

在线代理服务器的使用与推荐