Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程


Posted in Python onNovember 11, 2021

前言

接着我的上一篇:Python 详解爬取并统计CSDN全站热榜标题关键词词频流程

我换成Scrapy架构也实现了一遍。获取页面源码底层原理是一样的,Scrapy架构更系统一些。下面我会把需要注意的问题,也说明一下。

提供一下GitHub仓库地址:github本项目地址

环境部署

scrapy安装

pip install scrapy -i https://pypi.douban.com/simple

selenium安装

pip install selenium -i https://pypi.douban.com/simple

jieba安装

pip install jieba -i https://pypi.douban.com/simple

IDE:PyCharm

google chrome driver下载对应版本:google chrome driver下载地址

检查浏览器版本,下载对应版本。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

实现过程

下面开始搞起。

创建项目

使用scrapy命令创建我们的项目。

scrapy startproject csdn_hot_words

项目结构,如同官方给出的结构。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

定义Item实体

按照之前的逻辑,主要属性为标题关键词对应出现次数的字典。代码如下:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 
class CsdnHotWordsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    words = scrapy.Field()

关键词提取工具

使用jieba分词获取工具。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 23:47
# @Author  : 至尊宝
# @Site    : 
# @File    : analyse_sentence.py
 
import jieba.analyse
 
 
def get_key_word(sentence):
    result_dic = {}
    words_lis = jieba.analyse.extract_tags(
        sentence, topK=3, withWeight=True, allowPOS=())
    for word, flag in words_lis:
        if word in result_dic:
            result_dic[word] += 1
        else:
            result_dic[word] = 1
    return result_dic

爬虫构造

这里需要给爬虫初始化一个浏览器参数,用来实现页面的动态加载。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 23:47
# @Author  : 至尊宝
# @Site    : 
# @File    : csdn.py
 
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
from csdn_hot_words.items import CsdnHotWordsItem
from csdn_hot_words.tools.analyse_sentence import get_key_word
 
 
class CsdnSpider(scrapy.Spider):
    name = 'csdn'
    # allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net/rank/list']
 
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')  # 使用无头谷歌浏览器模式
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--no-sandbox')
        self.browser = webdriver.Chrome(chrome_options=chrome_options,
                                        executable_path="E:\\chromedriver_win32\\chromedriver.exe")
        self.browser.set_page_load_timeout(30)
 
    def parse(self, response, **kwargs):
        titles = response.xpath("//div[@class='hosetitem-title']/a/text()")
        for x in titles:
            item = CsdnHotWordsItem()
            item['words'] = get_key_word(x.get())
            yield item

代码说明

1、这里使用的是chrome的无头模式,就不需要有个浏览器打开再访问,都是后台执行的。

2、需要添加chromedriver的执行文件地址。

3、在parse的部分,可以参考之前我文章的xpath,获取到标题并且调用关键词提取,构造item对象。

中间件代码构造

添加js代码执行内容。中间件完整代码:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time
 
from selenium.webdriver.chrome.options import Options
 
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
 
 
class CsdnHotWordsSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
 
        # Should return None or raise an exception.
        return None
 
    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
 
        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i
 
    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
 
        # Should return either None or an iterable of Request or item objects.
        pass
 
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn't have a response associated.
 
        # Must return only requests (not items).
        for r in start_requests:
            yield r
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
 
 
class CsdnHotWordsDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request, spider):
        js = '''
                        let height = 0
                let interval = setInterval(() => {
                    window.scrollTo({
                        top: height,
                        behavior: "smooth"
                    });
                    height += 500
                }, 500);
                setTimeout(() => {
                    clearInterval(interval)
                }, 20000);
            '''
        try:
            spider.browser.get(request.url)
            spider.browser.execute_script(js)
            time.sleep(20)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
                                encoding="utf-8", request=request)
        except TimeoutException as e:
            print('超时异常:{}'.format(e))
            spider.browser.execute_script('window.stop()')
        finally:
            spider.browser.close()
 
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
 
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
 
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

制作自定义pipeline

定义按照词频统计最终结果输出到文件。代码如下:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 
 
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
 
 
class CsdnHotWordsPipeline:
 
    def __init__(self):
        self.file = open('result.txt', 'w', encoding='utf-8')
        self.all_words = []
 
    def process_item(self, item, spider):
        self.all_words.append(item)
        return item
 
    def close_spider(self, spider):
        key_word_dic = {}
        for y in self.all_words:
            print(y)
            for k, v in y['words'].items():
                if k.lower() in key_word_dic:
                    key_word_dic[k.lower()] += v
                else:
                    key_word_dic[k.lower()] = v
        word_count_sort = sorted(key_word_dic.items(),
                                 key=lambda x: x[1], reverse=True)
        for word in word_count_sort:
            self.file.write('{},{}\n'.format(word[0], word[1]))
        self.file.close()

settings配置

配置上要做一些调整。如下调整:

# Scrapy settings for csdn_hot_words project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
BOT_NAME = 'csdn_hot_words'
 
SPIDER_MODULES = ['csdn_hot_words.spiders']
NEWSPIDER_MODULE = 'csdn_hot_words.spiders'
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'csdn_hot_words (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
 
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
 
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 30
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
 
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
 
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
 
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
}
 
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'csdn_hot_words.middlewares.CsdnHotWordsSpiderMiddleware': 543,
}
 
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'csdn_hot_words.middlewares.CsdnHotWordsDownloaderMiddleware': 543,
}
 
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }
 
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'csdn_hot_words.pipelines.CsdnHotWordsPipeline': 300,
}
 
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
 
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

执行主程序

可以通过scrapy的命令执行,但是为了看日志方便,加了一个主程序代码。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 22:41
# @Author  : 至尊宝
# @Site    : 
# @File    : main.py
from scrapy import cmdline
 
cmdline.execute('scrapy crawl csdn'.split())

执行结果

执行部分日志

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

得到result.txt结果。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

总结

看,java还是yyds。不知道为什么2021这个关键词也可以排名靠前。于是我觉着把我标题也加上2021。

GitHub项目地址在发一遍:github本项目地址

申明一下,本文案例仅研究探索使用,不是为了恶意攻击。

分享:

凡夫俗子不下苦功夫、死力气去努力做成一件事,根本就没资格去谈什么天赋不天赋。

——烽火戏诸侯《剑来》

如果本文对你有用的话,请不要吝啬你的赞,谢谢。

以上就是Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程的详细内容,更多关于Python Scrapy框架的资料请关注三水点靠木其它相关文章!

Python 相关文章推荐
python读写二进制文件的方法
May 09 Python
Python使用turtule画五角星的方法
Jul 09 Python
Python实现视频下载功能
Mar 14 Python
pandas 转换成行列表进行读取与Nan处理的方法
Oct 30 Python
浅谈Python的list中的选取范围
Nov 12 Python
python实现五子棋小游戏
Mar 25 Python
Flask框架模板继承实现方法分析
Jul 31 Python
python PIL/cv2/base64相互转换实例
Jan 09 Python
TensorFlow的环境配置与安装教程详解(win10+GeForce GTX1060+CUDA 9.0+cuDNN7.3+tensorflow-gpu 1.12.0+python3.5.5)
Jun 22 Python
如何在mac版pycharm选择python版本
Jul 21 Python
Python爬取你好李焕英豆瓣短评生成词云的示例代码
Feb 24 Python
python神经网络学习 使用Keras进行回归运算
May 04 Python
Python 多线程处理任务实例
Nov 07 #Python
python利用while求100内的整数和方式
Nov 07 #Python
python中if和elif的区别介绍
Nov 07 #Python
python中取整数的几种方法
Python 中的 copy()和deepcopy()
Nov 07 #Python
Python MNIST手写体识别详解与试练
Python基础 括号()[]{}的详解
Nov 07 #Python
You might like
PHP中的str_repeat函数在JavaScript中的实现
2013/09/16 PHP
PHP基于curl post实现发送url及相关中文乱码问题解决方法
2017/11/25 PHP
找到了一篇jQuery与Prototype并存的冲突的解决方法
2007/08/29 Javascript
jQuery 动画弹出窗体支持多种展现方式
2010/04/29 Javascript
js null undefined 空区别说明
2010/06/13 Javascript
jquery 无限级联菜单案例分享
2013/03/26 Javascript
编写js扩展方法判断一个数组中是否包含某个元素
2013/11/08 Javascript
用于deeplink的js方法(判断手机是否安装app)
2014/04/02 Javascript
jquery实现二级导航下拉菜单效果
2015/12/18 Javascript
微信公众平台开发教程(六)获取个性二维码的实例
2016/12/02 Javascript
基于ajax与msmq技术的消息推送功能实现代码
2016/12/26 Javascript
JavaScript仿微信打飞机游戏
2020/07/05 Javascript
vue获取当前激活路由的方法
2018/03/17 Javascript
Vue.js 实现微信公众号菜单编辑器功能(二)
2018/05/08 Javascript
使用vue打包时vendor文件过大或者是app.js文件很大的问题
2018/06/29 Javascript
Vue多环境代理配置方法思路详解
2019/06/21 Javascript
vue中echarts图表大小适应窗口大小且不需要刷新案例
2020/07/19 Javascript
[03:18]【TI9纪实】社区大触GL与木木
2019/08/25 DOTA
Python求解平方根的方法
2015/03/11 Python
Django框架实现的分页demo示例
2019/05/25 Python
Django中的session用法详解
2020/03/09 Python
Python 操作 PostgreSQL 数据库示例【连接、增删改查等】
2020/04/21 Python
面向新手解析python Beautiful Soup基本用法
2020/07/11 Python
详解Anaconda 的安装教程
2020/09/23 Python
python实现企业微信定时发送文本消息的实例代码
2020/11/25 Python
详解CSS3新增的背景属性
2019/12/25 HTML / CSS
HTML5 File接口在web页面上使用文件下载
2017/02/27 HTML / CSS
三只松鼠官方旗舰店:全网坚果销售第1
2017/11/25 全球购物
德国受欢迎的旅游和休闲网站:lastminute.de
2019/09/23 全球购物
如何写一个自定义标签
2012/12/28 面试题
2014年公司迎新年活动方案
2014/02/24 职场文书
《七颗钻石》教学反思
2014/02/28 职场文书
毕业评语大全
2014/05/04 职场文书
篝火晚会策划方案
2014/05/16 职场文书
民事诉讼代理词
2015/05/25 职场文书
vue项目打包后路由错误的解决方法
2022/04/13 Vue.js