Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程

864次阅读

没有评论

前言
步骤
- 步骤一
- - 分析股票代码网页
  - 创建一个Scrapy工程
  - 工程中新建一个Scrapy爬虫
  - 获得股票代码
- 步骤二
- - 分析股票信息网页
  - 获得股票名字
  - 获得股票关键信息
  - 配置stock爬虫
  - - 修改stock.py文件
    - 修改pipelines.py文件
    - 修改settings.py文件
- 步骤三
- - 运行爬虫
全代码
- stock.py
- pipelines.py
- settings.py(选取非注释部分)

本文获得的信息仅供学习使用，不会用于商业用途（提前狗头保命）

使用Scrapy优点:

并发性能好
结构化

文章使用的版本：

Python 3.7.3
scrapy 2.0.1
Pycharm 2019.3 Community
Re
BeautifulSoup

获得股票的代码（没有股票代码怎么爬信息◑﹏◐）
PS：你也可以爬取股票排行(￣_,￣ )

通过股票代码获得股票信息

运行爬虫

找一个有股票信息的网站，我选的是东方财富网的股票代码查询一览表，有上海和深圳的股票信息。点击查看(。・∀・)ノ
URL：http://quote.eastmoney.com/stock_list.html
Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程
在Chrome中,按键Ctrl+U可以查看网页源代码:

在这里可以发现，股票代码（sh/szXXXXXX）其是有规律的，为sh/sz开头，后面有6位数字，它的每个股票代码对应的网址中有股票代码。
PS：股票名称后面括号里面的数字没有sh/sz不方便后续的信息获取。

你可以使用多种信息提取方法，本次使用的是css
节选一个：

观察后,可以看到,股票代码在一个<a>标签中,而<a>标签中有一个href属性,它的值为网址。通过css对<a>标签进行提取便可获得这个网址。

接下来便开始使用scrapy。

由于scrapy通过命令行使用，故你需要用Terminal（cmd命令行）来创建工程

而Pycharm自带Terminal（通常在Pycharm底部有一个Terminal窗口，点击即可进入，默认的打开位置为Pycharm工程文件目录下），故只需打开它即可输入命令。
Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程
输入命令:

scrapy startproject demo

（demo为工程名字，可以修改(●ˇ∀ˇ●)）

即可在当前目录下创建一个demo的scrapy工程。
之后便在对应目录下生成一个工程文件夹
Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程

在创建Scrapy爬虫之前，你需要进入这个工程（就是进入工程文件夹demo）

注意： 需要进入这个工程,且爬虫的文件名不能与工程名称相同

使用命令行执行cd命令，进入工程文件夹：

cd demo

Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程
输入指令： 用Scrapy的genspider创建爬虫，命令为：

scrapy genspider stock quote.eastmoney.com

stock为爬虫名称
quote.eastmoney.com为域名
Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程
于是我们就看到在demo/spiders目录下产生了一个爬虫文件stock.py，打开这个文件：

它有初始的模板，其中：
allowed_domain表示允许爬虫爬取的域名范围，本次例题不需要，需要注释掉。

start_urls表示初始爬取的网址，我们把它修改成我们的目标网址。
将其中字符串内的网址替换为http://quote.eastmoney.com/stock_list.html

def parse(self, response)parse函数，是scrapy访问start_urls后对返回信息进行处理的函数。response对象便是访问后返回的响应，在parse函数中，添加对返回信息的处理。

代码：

class StockSpider(scrapy.Spider): name = 'stock' # allowed_domains = ['quote.eastmoney.com'] start_urls = ['http://quote.eastmoney.com/stock_list.html']

之前我们知道股票代码在网页的什么地方，现在，把股票代码提取出来。

Scrapy在利用start_url获得网页信息后，返回response，利用函数parse（需要我们自己编写），处理返回的信息。

利用scrapy的response对象的css方法获得包含股票代码的网址：

for href in response.css('a::attr(href)').extract(): try: #内容 except: continue

这个返回的是一个迭代器，通过for循环可以循环获取<a>标签中的href属性（网址）。

之后利用re正则表达式，从网址中获得股票代码：

stock = re.search(r"[s][hz]\d{6}",href).group(0)

由于接下来的获得股票信息的网站需要大写的SH/SZ,故利用str的upper()方法将小写转大写:

stock=stock.upper()

代码：

class StockSpider(scrapy.Spider): name = 'stock' # allowed_domains = ['quote.eastmoney.com'] start_urls = ['http://quote.eastmoney.com/stock_list.html']

def parse(self, response): for href in response.css('a::attr(href)').extract(): try: stock = re.search(r"[s][hz]\d{6}", href).group(0) stock = stock.upper() #目前这里只是提取股票代码,没有后续的获得股票信息 except: continue

我们需要对获取到的股票代码做进一步的操作，通过股票网站获得其信息。
由于东方财富网的股票信息，使用js生成的，故不能直接通过源代码爬取，但是我们可以更换其他网站获得信息。

获得股票信息的网站我们使用的雪球网戳这里可以看(☆▽☆)
URL：https://xueqiu.com/S/SZ300783（三只松鼠）

该网站需要在headers中添加浏览器user-agent才可以正常获取。
本案例为修改setting.py中的user-agent参数(后面有参考)

Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程
分析网页的网址，发现都是在/S/后加上股票代码
于是我们使用的网址为：
URL：https://xueqiu.com/S/ + 股票代码

利用之前的获得的股票代码就可以从这个网站爬取股票信息
PS：这里的股票代码中SZ/SH为大写，而东方财富网获取到的为小写。需要利用str的upper()函数转换成大写。

这里我新建了一个parse_stock(self, response)的函数用于获得股票信息。

代码：

def parse(self, response): for href in response.css('a::attr(href)').extract(): try: stock = re.search(r"[s][hz]\d{6}", href).group(0) stock = stock.upper() url = 'https://xueqiu.com/S/' + stock yield scrapy.Request(url, callback = self.parse_stock) #parse_stock用于获得股票信息 except: continue def parse_stock(self, response): #接下来补充 pass

yield作用：在完成这个这个语句（scrapy.Request）后，冻结这个函数（parse），并将结果返回，之后唤醒这个函数，便会执行接下来的操作（访问下一个网页）。它很像迭代器。

使用按键Ctrl+U可以查看网页源代码（用Ctrl+U可以搜索网页内容）
以下为关键信息代码（大概在49行）：

}</script><script>window.STOCK_PAGE = true; SNB = { data: { quote: {"symbol":"SZ300783","high52w":81.5,"delayed":0,"type":11,"tick_size":0.01,"float_shares":41000000,"limit_down":60.72,"high":70,"float_market_capital":2824080000,"lot_size":100,"lock_set":null,"chg":"1.41","eps":0.83,"last_close":67.47,"profit_four":331710524.29,"volume":2540572,"volume_ratio":1.85,"profit_forecast":394224872,"turnover_rate":6.2,"low52w":17.62,"name":"三只松鼠","exchange":"SZ","pe_forecast":70.064,"total_shares":401000000,"status":1,"code":"300783","goodwill_in_net_assets":null,"avg_price":68.46,"percent":2.09,"amplitude":5.63,"current":"68.88","current_year_percent":7.01,"issue_date":1562860800000,"sub_type":"3","low":66.2,"market_capital":27620880000,"dividend":0,"dividend_yield":0,"currency":"CNY","navps":4.87,"profit":303859841.01,"timestamp":1585206243000,"pe_lyr":90.9,"amount":173917589.6,"pledge_ratio":null,"pb":14.144,"limit_up":74.22,"pe_ttm":83.268,"time":1585206243000,"open":66.2,"pankou_ratio":-16.19,"hasexist":false,"quoteMarket":{"status_id":7,"region":"CN","status":"已收盘","time_zone":"Asia/Shanghai","time_zone_desc":null,"statusStr":"已收盘"},"quoteRelation":[],"changeStr":"+1.41","percentStr":"+2.09%","stockColor":"stock-rise","parsedTime":"03-26 15:04:03（北京时间）","moneySymbol":"¥","afterHoursTime":"-","hasPankou":true,"flagStr":"","tableHtml":"<table class=\"quote-info\"><tr><td>最高：70.00</td><td>今开：66.20</td><td>涨停：74.22</td><td>成交量：25405手</td></tr><tr class=\"separateTop\"><td>最低：66.20</td><td>昨收：67.47</td><td>跌停：60.72</td><td>成交额：1.74亿</td></tr><tr class=\"separateBottom\"><td>量比：1.85</td><td>换手：6.20%</td><td>市盈率(动)：70.06</td><td>市盈率(TTM)：83.27</td></tr><tr><td>委比：-16.19%</td><td>振幅：5.63%</td><td>市盈率(静)：90.90</td><td>市净率：14.14</td></tr><tr><td>每股收益：0.83</td><td>股息(TTM)：0.00</td><td>总股本：4.01亿</td><td>总市值：276.21亿</td></tr><tr><td>每股净资产：4.87</td><td>股息率(TTM)：0.00%</td><td>流通股：4100.00万</td><td>流通值：28.24亿</td></tr><tr><td>52周最高：81.50</td><td>52周最低：17.62</td><td>货币单位：CNY</td></tr></table>","isMF":false,"isFundChart":false,"isDanjuan":false,"isNormalStock":true,"isFund":false,"isUSStock":false}, quoteTags: [] },

在找到关键信息后，我们便可以开始提取关键信息了

通过对网页源代码的分析
PS：使用Ctrl+f进行关键字搜索，不要一个个看过去，费眼睛(ノへ￣、)，或者右键使用检查。

我们找到了关键字，其格式为：
<div class="stock-name">三只松鼠(SZ:300783)</div>
利用css、re或beautifulsoup库都可以进行解析提取。

我在这里使用的是Re正则表达式，代码为：

name = re.search(r'<div class="stock-name">(.*?)</div>', response.text).group(1)

其中：response为parse函数的传入参数，利用response的text方法，可以获得网页返回的页面（unicode字符串）。

新建一个字典变量infoDict用来存储股票信息：

infoDict={} infoDict.update({'股票名称': name.__str__()})

接下来便是获得这个股票的信息了，分析代码可知，其中有一句为：

"tableHtml":"<table class=\"quote-info\"><tr><td>最高：70.00</td><td>今开：66.20</td><td>涨停：74.22</td><td>成交量：25405手</td></tr><tr class=\"separateTop\"><td>最低：66.20</td><td>昨收：67.47</td><td>跌停：60.72</td><td>成交额：1.74亿</td></tr><tr class=\"separateBottom\"><td>量比：1.85</td><td>换手：6.20%</td><td>市盈率(动)：70.06</td><td>市盈率(TTM)：83.27</td></tr><tr><td>委比：-16.19%</td><td>振幅：5.63%</td><td>市盈率(静)：90.90</td><td>市净率：14.14</td></tr><tr><td>每股收益：0.83</td><td>股息(TTM)：0.00</td><td>总股本：4.01亿</td><td>总市值：276.21亿</td></tr><tr><td>每股净资产：4.87</td><td>股息率(TTM)：0.00%</td><td>流通股：4100.00万</td><td>流通值：28.24亿</td></tr><tr><td>52周最高：81.50</td><td>52周最低：17.62</td><td>货币单位：CNY</td></tr></table>"

这个便是网页显示的股票信息，它是以表格形式存在的。
Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程

使用正则表达式，来获取这个表格信息：

tableHtml = re.search(r'"tableHtml":"(.*?)",', response.text).group(1)

group(1)表示获得"tableHtml":"(.*?)",中括号内的HTML内容。

<table class=\"quote-info\"><tr><td>最高：70.00</td><td>今开：66.20</td><td>涨停：74.22</td><td>成交量：25405手</td></tr><tr class=\"separateTop\"><td>最低：66.20</td><td>昨收：67.47</td><td>跌停：60.72</td><td>成交额：1.74亿</td></tr><tr class=\"separateBottom\"><td>量比：1.85</td><td>换手：6.20%</td><td>市盈率(动)：70.06</td><td>市盈率(TTM)：83.27</td></tr><tr><td>委比：-16.19%</td><td>振幅：5.63%</td><td>市盈率(静)：90.90</td><td>市净率：14.14</td></tr><tr><td>每股收益：0.83</td><td>股息(TTM)：0.00</td><td>总股本：4.01亿</td><td>总市值：276.21亿</td></tr><tr><td>每股净资产：4.87</td><td>股息率(TTM)：0.00%</td><td>流通股：4100.00万</td><td>流通值：28.24亿</td></tr><tr><td>52周最高：81.50</td><td>52周最低：17.62</td><td>货币单位：CNY</td></tr></table>

接下来我们可以使用BeautifulSoup库来解析这段HTML代码：

soup = BeautifulSoup(tableHtml, "html.parser") table = soup.table

股票信息的标签格式为:
例如:<td>最高：70.00</td>

然后利用循环获得标签内的信息，用字典格式保存起来:

for i in table.find_all("td"): line = i.text l = line.split("：")#这里的冒号为中文的冒号(：)!!!而不是英文的(:) infoDict.update({l[0].__str__(): l[1].__str__()})#将信息用字典格式保存起来

代码：

def parse_stock(self, response): infoDict = {} if response == "": exit() try: name = re.search(r'<div class="stock-name">(.*?)</div>', response.text).group(1) infoDict.update({'股票名称': name.__str__()}) tableHtml = re.search(r'"tableHtml":"(.*?)",', response.text).group(1) soup = BeautifulSoup(tableHtml, "html.parser") table = soup.table for i in table.find_all("td"): line = i.text l = line.split("：")#这里的冒号为中文的冒号(：)!!!而不是英文的(:) infoDict.update({l[0].__str__(): l[1].__str__()}) yield infoDict except: print("error")

在StockSpider类中修改如下:

class StockSpider(scrapy.Spider): name = 'stock' # allowed_domains = ['quote.eastmoney.com'] start_urls = ['http://quote.eastmoney.com/stock_list.html']

由于准备完成，现在在stock.py文件中编写parse函数，表示获得东方财富网的股票页面后进行的提取，之后利用雪球网获得股票信息。这里的获得股票信息写在函数parse_stock中。

#Parse(StockSpider类中的函数) def parse(self, response): for href in response.css('a::attr(href)').extract(): try: stock = re.search(r"[s][hz]\d{6}", href).group(0) stock = stock.upper() url = 'https://xueqiu.com/S/' + stock yield scrapy.Request(url, callback = self.parse_stock) except: continue

获得股票信息函数parse_stock

#parse_stock(StockSpider类中的函数) def parse_stock(self, response): infoDict = {} if response == "": exit() try: name = re.search(r'<div class="stock-name">(.*?)</div>', response.text).group(1) infoDict.update({'股票名称': name.__str__()}) tableHtml = re.search(r'"tableHtml":"(.*?)",', response.text).group(1) soup = BeautifulSoup(tableHtml, "html.parser") table = soup.table for i in table.find_all("td"): line = i.text l = line.split("：")#这里的冒号为中文的冒号(：)!!!而不是英文的(:) infoDict.update({l[0].__str__(): l[1].__str__()}) yield infoDict except: print("error")

在parse写完后，需要将数据存入文件。通过scrapy的ITEM PIPELINES结构将字典数据存入文件。

这里需要打开demo工程文件的pipelines.py文件。
Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程
在这里新建一个类，用来将获得到的股票信息存入文件中。
代码为：

class stockPipeline(object): def open_spider(self,spider):#使用爬虫时 self.f = open('XueQiuStock.txt','w')#打开文件

def close_spider(self,spider):#爬虫结束时 self.f.close()#关闭文件

def process_item(self,item,spider):#处理item try: line = str(dict(item)) + '\n' self.f.write(line) except: pass return item

在这完成后，我们需要修改demo工程文件中的配置文件settings.py

由于没有使用pipelines自带的类,故需要修改配置文件,使我们的程序能够找到我们编写的stockPipeline，正常保存文件。

找到其中的ITEM_PIPELINES,默认为被注释掉的，需要手动去除，或直接添加。

将'demo.pipelines.DemoPipeline': 300修改

代码：

# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'demo.pipelines.stockPipeline': 300, }

在这之后,我们需要修改我们的爬虫标识user-agent.
使其可以正常排球雪球网的内容。

这里的USER_AGENT也是默认被注释掉的，需要手动去除，或直接添加。

# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6"

完成之后便可以进行爬虫了。

（PS：修改settings.py的CONCURRENT_REQUESTS属性可以修改scrapy的并发数量,默认16)

# Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 32

在Terminal命令行中输入:
scrapy crawl stock
即可运行这个爬虫。

获得到的文件保存在demo/XueQiuStock.txt中。

如果出现运行完毕没有文件的情况出现，可以试一下：
scrapy crawl stock -o stock.csv
————————————————
这个指令，它的帮助中介绍如下（可以输出scrapy crawl -h查看）：
Options
–output=FILE, -o FILE dump scraped items into FILE (use – for stdout)
而-o转储可以支持的文件格式：‘json’, ‘jsonlines’,‘jl’, ‘csv’, ‘xml’, ‘marshal’, ‘pickle’
————————————————
感谢weixin_45900412提供的帮助。

Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程

# -*- coding: utf-8 -*- import scrapy from bs4 import BeautifulSoup import re

class StockSpider(scrapy.Spider): name = 'stock' # allowed_domains = ['quote.eastmoney.com'] start_urls = ['http://quote.eastmoney.com/stock_list.html']

def parse(self, response): for href in response.css('a::attr(href)').extract(): try: stock = re.search(r"[s][hz]\d{6}", href).group(0) stock = stock.upper() url = 'https://xueqiu.com/S/' + stock yield scrapy.Request(url, callback = self.parse_stock) except: continue

def parse_stock(self, response): infoDict = {} if response == "": exit() try: name = re.search(r'<div class="stock-name">(.*?)</div>', response.text).group(1) infoDict.update({'股票名称': name.__str__()}) tableHtml = re.search(r'"tableHtml":"(.*?)",', response.text).group(1) soup = BeautifulSoup(tableHtml, "html.parser") table = soup.table for i in table.find_all("td"): line = i.text l = line.split("：")#这里的冒号为中文的冒号(：)!!!而不是英文的(:) infoDict.update({l[0].__str__(): l[1].__str__()}) yield infoDict except: print("error")

# -*- coding: utf-8 -*-

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class DemoPipeline(object): def process_item(self, item, spider): return item

class stockPipeline(object): def open_spider(self,spider): self.f = open('XueQiuStock.txt','w')

def close_spider(self,spider): self.f.close()

def process_item(self,item,spider): try: line = str(dict(item)) + '\n' self.f.write(line) except: pass return item

# -*- coding: utf-8 -*- BOT_NAME = 'demo'

SPIDER_MODULES = ['demo.spiders'] NEWSPIDER_MODULE = 'demo.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6"

# Obey robots.txt rules ROBOTSTXT_OBEY = True

# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'demo.pipelines.stockPipeline': 300, }

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

Python爬虫实战：利用Scrapy框架，爬取股票信息，详细步骤过程

文章目录

前言

步骤

步骤一

分析股票代码网页

创建一个Scrapy工程

工程中新建一个Scrapy爬虫

获得股票代码

步骤二

分析股票信息网页

获得股票名字

获得股票关键信息

配置stock爬虫

修改stock.py文件

修改pipelines.py文件

修改settings.py文件

步骤三

运行爬虫

全代码

stock.py

pipelines.py

settings.py(选取非注释部分)

相关文章：