编程 Python

Python如何使用队列方式实现多线程爬虫

Posted in Python onMay 12, 2020

说明：糗事百科段子的爬取，采用了队列和多线程的方式，其中关键点是Queue.task_done()、Queue.join()，保证了线程的有序进行。

代码如下

import requests
from lxml import etree
import json
from queue import Queue
import threading

class Qsbk(object):
  def __init__(self):
    self.headers = {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
      "Referer": "https://www.qiushibaike.com/"
    }
    # 实例化三个队列，用来存放内容
    self.url_queue = Queue()
    self.html_queue = Queue()
    self.content_queue = Queue()

  def get_total_url(self):
    """
    获取了所有的页面url，并且返回url_list
    return:url_list
    现在放入url_queue队列中保存
    """
    url_temp = "https://www.qiushibaike.com/text/page/{}/"
    url_list = list()
    for i in range(1,13):
      # url_list.append(url_temp.format(i))
      # 将生成的url放入url_queue队列
      self.url_queue.put(url_temp.format(i))

  def parse_url(self):
    """
    发送请求，获取响应，同时etree处理html
    """
    while self.url_queue.not_empty:
      # 判断非空，为空时结束循环

      # 从队列中取出一个url
      url = self.url_queue.get()
      print("parsing url:",url)
      # 发送请求
      response = requests.get(url,headers=self.headers,timeout=10)
      # 获取html字符串
      html = response.content.decode()
      # 获取element类型的html
      html = etree.HTML(html)
      # 将生成的element对象放入html_queue队列
      self.html_queue.put(html)
      # Queue.task_done() 在完成一项工作之后，Queue.task_done()函数向任务已经完成的队列发送一个信号
      self.url_queue.task_done()

  def get_content(self):
    """
    解析网页内容，获取想要的信息
    """
    while self.html_queue.not_empty:
      items = list()
      html = self.html_queue.get()
      total_div = html.xpath("//div[@class='col1 old-style-col1']/div")
      for i in total_div:

        author_img = i.xpath(".//a[@rel='nofollow']/img/@src")
        author_img = "https"+author_img[0] if len(author_img)>0 else None

        author_name = i.xpath(".//a[@rel='nofollow']/img/@alt")
        author_name = author_name[0] if len(author_name)>0 else None

        author_href = i.xpath("./a/@href")
        author_href = "https://www.qiushibaike.com/"+author_href[0] if len(author_href)>0 else None

        author_gender = i.xpath("./div[1]/div/@class")
        author_gender = author_gender[0].split(" ")[-1].replace("Icon","").strip() if len(author_gender)>0 else None

        author_age = i.xpath("./div[1]/div/text()")
        author_age = author_age[0] if len(author_age)>0 else None

        content = i.xpath("./a/div/span/text()")
        content = content[0].strip() if len(content)>0 else None

        content_vote = i.xpath("./div[@class='stats']/span[@class='stats-vote']/i/text()")
        content_vote = content_vote[0] if len(content_vote)>0 else None

        content_comment_numbers = i.xpath("./div[@class='stats']/span[@class='stats-comments']/a/i/text()")
        content_comment_numbers = content_comment_numbers[0] if len(content_comment_numbers)>0 else None

        item = {
          "author_name":author_name,
          "author_age" :author_age,
          "author_gender":author_gender,
          "author_img":author_img,
          "author_href":author_href,
          "content":content,
          "content_vote":content_vote,
          "content_comment_numbers":content_comment_numbers,
        }
        items.append(item)
      self.content_queue.put(items)
      # task_done的时候，队列计数减一
      self.html_queue.task_done()

  def save_items(self):
    """
    保存items
    """
    while self.content_queue.not_empty:
      items = self.content_queue.get()
      with open("quishibaike.txt",'a',encoding='utf-8') as f:
        for i in items:
          json.dump(i,f,ensure_ascii=False,indent=2)
      self.content_queue.task_done()

  def run(self):
    # 获取url list
    thread_list = list()
    thread_url = threading.Thread(target=self.get_total_url)
    thread_list.append(thread_url)

    # 发送网络请求
    for i in range(10):
      thread_parse = threading.Thread(target=self.parse_url)
      thread_list.append(thread_parse)

    # 提取数据
    thread_get_content = threading.Thread(target=self.get_content)
    thread_list.append(thread_get_content)

    # 保存
    thread_save = threading.Thread(target=self.save_items)
    thread_list.append(thread_save)


    for t in thread_list:
      # 为每个进程设置为后台进程，效果是主进程退出子进程也会退出
      t.setDaemon(True)
      t.start()
    
    # 让主线程等待，所有的队列为空的时候才能退出
    self.url_queue.join()
    self.html_queue.join()
    self.content_queue.join()


if __name__=="__main__":
  obj = Qsbk()
  obj.run()

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

Python如何使用队列方式实现多线程爬虫

- Author -

Norni

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python实现猜数字游戏(无重复数字)示例分享

Mar 29 Python

Python实现二分法算法实例

Feb 02 Python

详解Python中的文件操作

Aug 28 Python

全面了解python中的类,对象,方法,属性

Sep 11 Python

Python使用ConfigParser模块操作配置文件的方法

Jun 29 Python

基于Python的ModbusTCP客户端实现详解

Jul 13 Python

对django中foreignkey的简单使用详解

Jul 28 Python

python ctypes库2_指定参数类型和返回类型详解

Nov 19 Python

Pytorch之finetune使用详解

Jan 18 Python

python math模块的基本使用教程

Jan 16 Python

Python 阶乘详解

Oct 05 Python

关于的python五子棋的算法

May 02 Python

python的Jenkins接口调用方式

May 12 #Python

jenkins+python自动化测试持续集成教程

May 12 #Python

python百行代码自制电脑端网速悬浮窗的实现

May 12 #Python

基于Python的Jenkins的二次开发操作

May 12 #Python

Python-jenkins模块获取jobs的执行状态操作

May 12 #Python

Python-jenkins 获取job构建信息方式

May 12 #Python

python进行参数传递的方法

May 12 #Python

You might like

PHP使用FFmpeg获取视频播放总时长与码率等信息

2016/09/13 PHP

php结合md5的加密解密算法实例

2016/09/30 PHP

jQuery实现下拉框左右选择的简单实例

2014/02/22 Javascript

EasyUI实现第二层弹出框的方法

2015/03/01 Javascript

jquery判断输入密码两次是否相等

2020/04/22 Javascript

WordPress中鼠标悬停显示和隐藏评论及引用按钮的实现

2016/01/12 Javascript

jQuery弹出层后禁用底部滚动条(移动端关闭回到原位置)

2016/08/29 Javascript

浅谈JS中的三种字符串连接方式及其性能比较

2016/09/02 Javascript

Angular2 NgModule 模块详解

2016/10/19 Javascript

详解如何较好的使用js

2016/12/16 Javascript

jquery+html仿翻页相册功能

2016/12/20 Javascript

d3.js中冷门却实用的内置函数总结

2017/02/04 Javascript

JavaScript闭包的简单应用

2017/09/01 Javascript

vue--点击当前增加class,其他删除class的方法

2018/09/15 Javascript

如何从0开始用node写一个自己的命令行程序

2018/12/29 Javascript

Object.keys() 和 Object.getOwnPropertyNames() 的区别详解

2020/05/21 Javascript

JS实现百度搜索框

2021/02/25 Javascript

[42:25]2018DOTA2亚洲邀请赛 4.5 淘汰赛 LGD vs Liquid 第三场

2018/04/06 DOTA

简介Python设计模式中的代理模式与模板方法模式编程

2016/02/02 Python

python模块smtplib学习

2018/05/22 Python

Python图像处理之gif动态图的解析与合成操作详解

2018/12/30 Python

PyQt4编程之让状态栏显示信息的方法

2019/06/18 Python

django迁移数据库错误问题解决

2019/07/29 Python

基于Python实现ComicReaper漫画自动爬取脚本过程解析

2019/11/11 Python

css3实现垂直下拉动画菜单示例

2014/04/22 HTML / CSS

html5指南-3.如何实现html元素拖拽功能

2013/01/07 HTML / CSS

露营世界：Camping World

2017/02/02 全球购物

优秀求职信范文分享

2013/12/19 职场文书

中英文自我评价语句

2013/12/20 职场文书

一句话工作感言

2014/03/01 职场文书

办公室务虚会发言材料

2014/10/20 职场文书

2016大学生社会实践单位评语

2015/12/01 职场文书

Python爬虫爬取全球疫情数据并存储到mysql数据库的步骤

2021/03/29 Python

python中print格式化输出的问题

2021/04/16 Python

JVM之方法返回地址详解

2022/02/28 Java/Android

Java中的随机数Random

2022/03/17 Java/Android