python外国网站爬虫

694次阅读
没有评论
python外国网站爬虫

前言

上一篇中我们在维基百科的内部网站上随机跳转进入文章类网页,而忽视外部网站链接。本篇文章将处理网站的外部链接并试图收集一些网站数据。和单个域名网站爬取不同,不同域名的网站结构千差万别,这就意味我们的代码需要更加的灵活以适应不同的网站结构。

因此,我们将代码写成一组函数,这些函数组合起来就可以应用在不同类型的网络爬虫需求。

随机跳转外部链接

利用函数组,我们可以在50行左右满足爬取外部网站的需求。

示例代码:

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

import datetime

import random

from urllib.parse import quote

pages = set()

random.seed(datetime.datetime.now())

''' 获取一个网页的所有互联网链接'''

# 获取网页所有内部链接

def get_internal_links(soup, include_url):

internal_links = []

# find all links that befin with a '/'

print(include_url)

for link in soup.find_all('a',

href=re.compile(r'^((/|.)*' + include_url + ')')):

if link.attrs['href'] is not None:

if link.attrs['href'] not in internal_links:

internal_links.append(link.attrs['href'])

return internal_links

# retrieves a list of all external links found on a page

#获取网页上所有外部链接

def get_external_links(soup, exclude_url):

external_links = []

# Finds all links that starts with 'http' or 'www' that do not contain the

# current URL

for link in soup.find_all('a',

href=re.compile(r'^(http|www)((?!' + exclude_url + ').)*$')):

if link.attrs['href'] is not None:

if link.attrs['href'] not in external_links:

external_links.append(link.attrs['href'])

return external_links

#拆分网址获取主域名

def split_address(address):

address_parts = address.replace('http://', '').split('/')

return address_parts

#随机外部链接跳转

def get_random_external_link(starting_page):

html = urlopen(starting_page)

soup = BeautifulSoup(html, 'lxml')

external_links = get_external_links(

soup, split_address(starting_page)[0]) # find the domain URL

if len(external_links) == 0:

internal_links = get_internal_links(soup, starting_page)

print(len(internal_links))

return get_external_links(soup,

internal_links[random.randint(0, len(internal_links) – 1)])

else:

return external_links[random.randint(0, len(external_links) – 1)]

hop_count = set()

#只跳转外部链接,设置跳转次数loop, 默认跳转5次

def follow_external_only(starting_site, loop=5):

global hop_count

external_link = get_random_external_link(

quote(starting_site, safe='/:?='))

print('Random external link is: ' + external_link)

while len(hop_count) < loop:

hop_count.add(external_link)

print(len(hop_count))

follow_external_only(external_link)

follow_external_only("http://www.baidu.com")

由于代码没有异常处理和反反爬虫处理,因此一定会报错。由于跳转是随机的,可以多运行几次,有兴趣的可以根据每次的报错原因完善代码。

输出结果:

Random external link is: http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=

1

Random external link is: http://baishi.baidu.com/watch/6388818335201070269.html

2

Random external link is: http://v.baidu.com/tv/

3

Random external link is: http://player.baidu.com/yingyin.html

4

Random external link is: http://help.baidu.com/question?prod_en=player

5

Random external link is: http://home.baidu.com

[Finished in 6.3s]

抓取网页上所有外部链接

把代码写成函数的好处是可以简单地修改或者添加以满足不同的需求而不会破坏代码。比如:

目的:爬取整个网页所有外部链接并对每个链接标记

我们可以添加如下函数:

# Collects a list of all external URLs found on the site

all_ext_links = set()

all_int_links = set()

def get_all_external_links(site_url):

html = urlopen(site_url)

soup = BeautifulSoup(html, 'lxml')

print(split_address(site_url)[0])

int

internal_links = get_internal_links(soup, split_address(site_url)[0])

external_links = get_external_links(soup, split_address(site_url)[0])

for link in external_links:

if link not in all_ext_links:

all_ext_links.add(link)

print(link)

for link in internal_links:

if link not in all_int_links:

print('About to get link: ' + link)

all_int_links.add(link)

get_all_external_links(link)

# follow_external_only("http://www.baidu.com")

get_all_external_links('http://oreilly.com')

输出结果如下:

oreilly.com

oreilly.com

https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf

http://twitter.com/oreillymedia

http://fb.co/OReilly

https://www.linkedin.com/company/oreilly-media

https://www.youtube.com/user/OreillyMedia

About to get link: https://www.oreilly.com

https:

https:

https://www.oreilly.com

http://www.oreilly.com/ideas

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav

http://www.oreilly.com/conferences/

http://shop.oreilly.com/

http://members.oreilly.com

https://www.oreilly.com/topics

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now

https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in

https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course

https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access

http://www.oreilly.com/live-training/?view=grid

https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform

https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends

https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles

http://www.oreilly.com/about/

http://www.oreilly.com/work-with-us.html

http://www.oreilly.com/careers/

http://shop.oreilly.com/category/customer-service.do

http://www.oreilly.com/about/contact.html

http://www.oreilly.com/emails/newsletters/

http://www.oreilly.com/terms/

http://www.oreilly.com/privacy.html

http://www.oreilly.com/about/editorial_independence.html

About to get link: https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav

https:

https:

https://www.oreilly.com/

About to get link: https://www.oreilly.com/

https:

https:

About to get link: https://www.oreilly.com/topics

……

程序会一直循环下去直到达到python默认的循环极限, 有兴趣的朋友可以像上面的代码一样添加默认循环限制loop=5。

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

相关文章:

版权声明:Python教程2022-10-24发表,共计6001字。
新手QQ群:570568346,欢迎进群讨论 Python51学习