Scrapy爬虫之中文乱码问题

466次阅读
没有评论
Scrapy爬虫之中文乱码问题

问题描述:

I.

#这是.csv格式的文件,有中文乱码现象。

[root@Uu jianshu]# cat jianshu.csv url,title,author http://www.jianshu.com/p/2a7a594816e1,彖浣犳 村?鏍? [root@Uu jianshu]# 璋㈣传绌凤兼娉绗锛?

II.

#这是.json格式的文件,也有中文显示问题

[root@Uu jianshu]# cat jianshu.json [ {"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["u542cu8bf4u4f60u611fu8c22u8d2bu7a77uff0cu6211u60f3u7b11uff0cu5374u54edu4e86"], "author": ["u65e0u6212"]} ]

问题解决过程:

I.  首先猜想用UTF-8,问题解决过程如下:

#UTF-8解决问题如下:

[root@Uu jianshu]# vi settings.py # -*- coding: utf-8 -*-

# Scrapy settings for jianshu project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders'] NEWSPIDER_MODULE = 'jianshu.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

USER_AGENT = USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

# Obey robots.txt rules ROBOTSTXT_OBEY = False # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#FEED_URL = u'/home/BS/jianshu.json' #FEED_FORMAT = 'json' FEED_EXPORT_ENCODING = 'UTF-8' #FEED_EXPORT_ENCODING = 'GBK' #FEED_EXPORT_ENCODING = 'GB2312' "settings.py" 98L, 3371C written [root@Uu jianshu]# cd .. [root@Uu jianshu]# ll total 8 drwxr-xr-x. 3 root root 174 Aug 28 22:35 jianshu -rw-r–r–. 1 root root 117 Aug 28 22:34 jianshu.json -rw-r–r–. 1 root root 257 Aug 28 14:44 scrapy.cfg [root@Uu jianshu]# rm -f jianshu.json [root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json 2018-08-28 22:35:51 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu) 2018-08-28 22:35:51 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) – [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core 2018-08-28 22:35:51 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'UTF-8', 'DOWNLOAD_DELAY': 5} 2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-08-28 22:35:51 [scrapy.core.engine] INFO: Spider opened 2018-08-28 22:35:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-08-28 22:35:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-08-28 22:35:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly> 2018-08-28 22:35:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None) 2018-08-28 22:35:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly> {'author': [u'u65e0u6212'], 'title': [u'u542cu8bf4u4f60u611fu8c22u8d2bu7a77uff0cu6211u60f3u7b11uff0cu5374u54edu4e86'], 'url': u'http://www.jianshu.com/p/2a7a594816e1'} 2018-08-28 22:35:57 [scrapy.core.engine] INFO: Closing spider (finished) 2018-08-28 22:35:57 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json 2018-08-28 22:35:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 606, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 10881, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 28, 14, 35, 57, 854597), 'item_scraped_count': 1, 'log_count/DEBUG': 4, 'log_count/INFO': 8, 'memusage/max': 42971136, 'memusage/startup': 42971136, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2018, 8, 28, 14, 35, 51, 387501)} 2018-08-28 22:35:57 [scrapy.core.engine] INFO: Spider closed (finished) [root@Uu jianshu]# cat jianshu.json [ {"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["彖浣犳 村?], "author": ["鏍?]} 璋㈣传绌凤兼娉绗锛? [root@Uu jianshu]#

 可见,在setting.py中设置参数FEED_EXPORT_ENCODING = 'UTF-8',并不能解决问题。

 II.试着用GBK解决问题(即设置参数FEED_EXPORT_ENCODING = 'GBK'),过程如下:

#GBK解决问题,过程如下:

[root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json 2018-08-28 22:32:40 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu) 2018-08-28 22:32:40 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) – [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core 2018-08-28 22:32:40 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'GBK', 'DOWNLOAD_DELAY': 5} 2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-08-28 22:32:40 [scrapy.core.engine] INFO: Spider opened 2018-08-28 22:32:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-08-28 22:32:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-08-28 22:32:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly> 2018-08-28 22:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None) 2018-08-28 22:32:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly> {'author': [u'u65e0u6212'], 'title': [u'u542cu8bf4u4f60u611fu8c22u8d2bu7a77uff0cu6211u60f3u7b11uff0cu5374u54edu4e86'], 'url': u'http://www.jianshu.com/p/2a7a594816e1'} 2018-08-28 22:32:46 [scrapy.core.engine] INFO: Closing spider (finished) 2018-08-28 22:32:46 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json 2018-08-28 22:32:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 606, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 10879, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 28, 14, 32, 46, 587323), 'item_scraped_count': 1, 'log_count/DEBUG': 4, 'log_count/INFO': 8, 'memusage/max': 42975232, 'memusage/startup': 42975232, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2018, 8, 28, 14, 32, 40, 291948)} 2018-08-28 22:32:46 [scrapy.core.engine] INFO: Spider closed (finished) [root@Uu jianshu]# cat jianshu.json [ {"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["听说你感谢贫穷,我想笑,却哭了"], "author": ["无戒"]}

 显而易见,问题得到解决,可以成功显示中文。

III.下面再试试GB2312(即设置参数FEED_EXPORT_ENCODING = 'GB2312'),过程如下:

 

#GB2312解决,过程如下:

[root@Uu jianshu]# vi settings.py # -*- coding: utf-8 -*-

# Scrapy settings for jianshu project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders'] NEWSPIDER_MODULE = 'jianshu.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

USER_AGENT = USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

# Obey robots.txt rules ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 5 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default) # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}

# Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'jianshu.pipelines.JianshuPipeline': 300, #}

# Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#FEED_URL = u'/home/BS/jianshu.json' #FEED_FORMAT = 'json' #FEED_EXPORT_ENCODING = 'UTF-8' #FEED_EXPORT_ENCODING = 'GBK' FEED_EXPORT_ENCODING = 'GB2312' "settings.py" 98L, 3371C written [root@Uu jianshu]# cd .. [root@Uu jianshu]# rm -f jianshu.json [root@Uu jianshu]# ll total 4 drwxr-xr-x. 3 root root 174 Aug 28 22:44 jianshu -rw-r–r–. 1 root root 257 Aug 28 14:44 scrapy.cfg [root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json 2018-08-28 22:45:25 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu) 2018-08-28 22:45:25 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) – [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core 2018-08-28 22:45:25 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'GB2312', 'DOWNLOAD_DELAY': 5} 2018-08-28 22:45:25 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-08-28 22:45:26 [scrapy.core.engine] INFO: Spider opened 2018-08-28 22:45:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-08-28 22:45:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-08-28 22:45:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly> 2018-08-28 22:45:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None) 2018-08-28 22:45:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly> {'author': [u'u65e0u6212'], 'title': [u'u542cu8bf4u4f60u611fu8c22u8d2bu7a77uff0cu6211u60f3u7b11uff0cu5374u54edu4e86'], 'url': u'http://www.jianshu.com/p/2a7a594816e1'} 2018-08-28 22:45:32 [scrapy.core.engine] INFO: Closing spider (finished) 2018-08-28 22:45:32 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json 2018-08-28 22:45:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 606, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 10873, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 28, 14, 45, 32, 543578), 'item_scraped_count': 1, 'log_count/DEBUG': 4, 'log_count/INFO': 8, 'memusage/max': 42971136, 'memusage/startup': 42971136, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2018, 8, 28, 14, 45, 26, 27174)} 2018-08-28 22:45:32 [scrapy.core.engine] INFO: Spider closed (finished) [root@Uu jianshu]# cat jianshu.json [ {"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["听说你感谢贫穷,我想笑,却哭了"], "author": ["无戒"]} ][root@Uu jianshu]#

 可以发现,问题同样得到解决,可见GB2312也可以成功解决中文乱码问题。

华丽的总结:

通过实验过程,可以发现:scrapy爬虫中的中文乱码问题只需要在setting.py设置参数FEED_EXPORT_ENCODING,

并且,只有将FEED_EXPORT_ENCODING设置为GBK或者GB2312才可以,设置为UTF-8不能解决问题。

 

 

 

 

 

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

相关文章:

版权声明:Python教程2022-10-25发表,共计19655字。
新手QQ群:570568346,欢迎进群讨论 Python51学习