Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程


Posted in Python onNovember 11, 2021

前言

接着我的上一篇:Python 详解爬取并统计CSDN全站热榜标题关键词词频流程

我换成Scrapy架构也实现了一遍。获取页面源码底层原理是一样的,Scrapy架构更系统一些。下面我会把需要注意的问题,也说明一下。

提供一下GitHub仓库地址:github本项目地址

环境部署

scrapy安装

pip install scrapy -i https://pypi.douban.com/simple

selenium安装

pip install selenium -i https://pypi.douban.com/simple

jieba安装

pip install jieba -i https://pypi.douban.com/simple

IDE:PyCharm

google chrome driver下载对应版本:google chrome driver下载地址

检查浏览器版本,下载对应版本。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

实现过程

下面开始搞起。

创建项目

使用scrapy命令创建我们的项目。

scrapy startproject csdn_hot_words

项目结构,如同官方给出的结构。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

定义Item实体

按照之前的逻辑,主要属性为标题关键词对应出现次数的字典。代码如下:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 
class CsdnHotWordsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    words = scrapy.Field()

关键词提取工具

使用jieba分词获取工具。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 23:47
# @Author  : 至尊宝
# @Site    : 
# @File    : analyse_sentence.py
 
import jieba.analyse
 
 
def get_key_word(sentence):
    result_dic = {}
    words_lis = jieba.analyse.extract_tags(
        sentence, topK=3, withWeight=True, allowPOS=())
    for word, flag in words_lis:
        if word in result_dic:
            result_dic[word] += 1
        else:
            result_dic[word] = 1
    return result_dic

爬虫构造

这里需要给爬虫初始化一个浏览器参数,用来实现页面的动态加载。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 23:47
# @Author  : 至尊宝
# @Site    : 
# @File    : csdn.py
 
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
from csdn_hot_words.items import CsdnHotWordsItem
from csdn_hot_words.tools.analyse_sentence import get_key_word
 
 
class CsdnSpider(scrapy.Spider):
    name = 'csdn'
    # allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net/rank/list']
 
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')  # 使用无头谷歌浏览器模式
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--no-sandbox')
        self.browser = webdriver.Chrome(chrome_options=chrome_options,
                                        executable_path="E:\\chromedriver_win32\\chromedriver.exe")
        self.browser.set_page_load_timeout(30)
 
    def parse(self, response, **kwargs):
        titles = response.xpath("//div[@class='hosetitem-title']/a/text()")
        for x in titles:
            item = CsdnHotWordsItem()
            item['words'] = get_key_word(x.get())
            yield item

代码说明

1、这里使用的是chrome的无头模式,就不需要有个浏览器打开再访问,都是后台执行的。

2、需要添加chromedriver的执行文件地址。

3、在parse的部分,可以参考之前我文章的xpath,获取到标题并且调用关键词提取,构造item对象。

中间件代码构造

添加js代码执行内容。中间件完整代码:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time
 
from selenium.webdriver.chrome.options import Options
 
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
 
 
class CsdnHotWordsSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
 
        # Should return None or raise an exception.
        return None
 
    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
 
        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i
 
    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
 
        # Should return either None or an iterable of Request or item objects.
        pass
 
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn't have a response associated.
 
        # Must return only requests (not items).
        for r in start_requests:
            yield r
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
 
 
class CsdnHotWordsDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request, spider):
        js = '''
                        let height = 0
                let interval = setInterval(() => {
                    window.scrollTo({
                        top: height,
                        behavior: "smooth"
                    });
                    height += 500
                }, 500);
                setTimeout(() => {
                    clearInterval(interval)
                }, 20000);
            '''
        try:
            spider.browser.get(request.url)
            spider.browser.execute_script(js)
            time.sleep(20)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
                                encoding="utf-8", request=request)
        except TimeoutException as e:
            print('超时异常:{}'.format(e))
            spider.browser.execute_script('window.stop()')
        finally:
            spider.browser.close()
 
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
 
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
 
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

制作自定义pipeline

定义按照词频统计最终结果输出到文件。代码如下:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 
 
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
 
 
class CsdnHotWordsPipeline:
 
    def __init__(self):
        self.file = open('result.txt', 'w', encoding='utf-8')
        self.all_words = []
 
    def process_item(self, item, spider):
        self.all_words.append(item)
        return item
 
    def close_spider(self, spider):
        key_word_dic = {}
        for y in self.all_words:
            print(y)
            for k, v in y['words'].items():
                if k.lower() in key_word_dic:
                    key_word_dic[k.lower()] += v
                else:
                    key_word_dic[k.lower()] = v
        word_count_sort = sorted(key_word_dic.items(),
                                 key=lambda x: x[1], reverse=True)
        for word in word_count_sort:
            self.file.write('{},{}\n'.format(word[0], word[1]))
        self.file.close()

settings配置

配置上要做一些调整。如下调整:

# Scrapy settings for csdn_hot_words project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
BOT_NAME = 'csdn_hot_words'
 
SPIDER_MODULES = ['csdn_hot_words.spiders']
NEWSPIDER_MODULE = 'csdn_hot_words.spiders'
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'csdn_hot_words (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
 
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
 
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 30
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
 
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
 
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
 
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
}
 
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'csdn_hot_words.middlewares.CsdnHotWordsSpiderMiddleware': 543,
}
 
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'csdn_hot_words.middlewares.CsdnHotWordsDownloaderMiddleware': 543,
}
 
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }
 
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'csdn_hot_words.pipelines.CsdnHotWordsPipeline': 300,
}
 
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
 
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

执行主程序

可以通过scrapy的命令执行,但是为了看日志方便,加了一个主程序代码。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 22:41
# @Author  : 至尊宝
# @Site    : 
# @File    : main.py
from scrapy import cmdline
 
cmdline.execute('scrapy crawl csdn'.split())

执行结果

执行部分日志

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

得到result.txt结果。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

总结

看,java还是yyds。不知道为什么2021这个关键词也可以排名靠前。于是我觉着把我标题也加上2021。

GitHub项目地址在发一遍:github本项目地址

申明一下,本文案例仅研究探索使用,不是为了恶意攻击。

分享:

凡夫俗子不下苦功夫、死力气去努力做成一件事,根本就没资格去谈什么天赋不天赋。

——烽火戏诸侯《剑来》

如果本文对你有用的话,请不要吝啬你的赞,谢谢。

以上就是Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程的详细内容,更多关于Python Scrapy框架的资料请关注三水点靠木其它相关文章!

Python 相关文章推荐
详细探究Python中的字典容器
Apr 14 Python
详解python3百度指数抓取实例
Dec 12 Python
python实现学生信息管理系统
Apr 05 Python
matplotlib subplots 调整子图间矩的实例
May 25 Python
浅谈Python3中strip()、lstrip()、rstrip()用法详解
Apr 29 Python
python使用pygame实现笑脸乒乓球弹珠球游戏
Nov 25 Python
基于python实现微信好友数据分析(简单)
Feb 16 Python
python自动点赞功能的实现思路
Feb 26 Python
python如何编写类似nmap的扫描工具
Nov 06 Python
python爬虫多次请求超时的几种重试方法(6种)
Dec 01 Python
python 视频下载神器(you-get)的具体使用
Jan 06 Python
Python中生成ndarray实例讲解
Feb 22 Python
Python 多线程处理任务实例
Nov 07 #Python
python利用while求100内的整数和方式
Nov 07 #Python
python中if和elif的区别介绍
Nov 07 #Python
python中取整数的几种方法
Python 中的 copy()和deepcopy()
Nov 07 #Python
Python MNIST手写体识别详解与试练
Python基础 括号()[]{}的详解
Nov 07 #Python
You might like
简体中文转换为繁体中文的PHP函数
2006/10/09 PHP
修改ThinkPHP缓存为Memcache的方法
2014/06/25 PHP
PHP filter_var() 函数, 验证判断EMAIL,URL等
2021/03/09 PHP
弹出层之1:JQuery.Boxy (一) 使用介绍
2011/10/06 Javascript
jquery中的常见问题及快速解决方法小结
2016/06/14 Javascript
AngularJS延迟加载html template
2016/07/27 Javascript
老生常谈JavaScript 正则表达式语法
2016/08/20 Javascript
JS点击某个图标或按钮弹出文件选择框的实现代码
2016/09/27 Javascript
JS实现给对象动态添加属性的方法
2017/01/05 Javascript
vue父组件向子组件(props)传递数据的方法
2018/01/02 Javascript
用react-redux实现react组件之间数据共享的方法
2018/06/08 Javascript
layui table设置前台过滤转义等方法
2018/08/17 Javascript
JavaScript原型对象原理与应用分析
2018/12/27 Javascript
微信小程序实现上传word、txt、Excel、PPT等文件功能
2019/05/23 Javascript
React 实现车牌键盘的示例代码
2019/12/20 Javascript
解决pycharm双击但是无法打开的情况
2020/10/31 Javascript
JavaScript实现原型封装轮播图
2020/12/27 Javascript
JS实现百度搜索框
2021/02/25 Javascript
[01:23]2014DOTA2国际邀请赛 球迷无处不在Ti现场世界杯受关注
2014/07/10 DOTA
Python下singleton模式的实现方法
2014/07/16 Python
Python复数属性和方法运算操作示例
2017/07/21 Python
Python使用numpy实现BP神经网络
2018/03/10 Python
Python Cookie 读取和保存方法
2018/12/28 Python
Python使用Pandas读写Excel实例解析
2019/11/19 Python
英国最大的香水商店:The Fragrance Shop
2018/07/06 全球购物
医学生自荐信范文
2013/12/03 职场文书
小学英语教学反思案例
2014/02/04 职场文书
校运会入场式解说词
2014/02/10 职场文书
租房协议书
2014/09/12 职场文书
申报优秀教师材料
2014/12/16 职场文书
心灵点滴观后感
2015/06/02 职场文书
2015年电气技术员工作总结
2015/07/24 职场文书
防溺水主题班会教案
2015/08/12 职场文书
2019学生会干事辞职信
2019/06/27 职场文书
Python帮你解决手机qq微信内存占用太多问题
2022/02/15 Python
vue2的 router在使用过程中遇到的一些问题
2022/04/13 Vue.js