编程 Python

基python实现多线程网页爬虫

Posted in Python onSeptember 06, 2015

一般来说，使用线程有两种模式, 一种是创建线程要执行的函数, 把这个函数传递进Thread对象里，让它来执行. 另一种是直接从Thread继承，创建一个新的class，把线程执行的代码放到这个新的class里。

实现多线程网页爬虫，采用了多线程和锁机制，实现了广度优先算法的网页爬虫。

先给大家简单介绍下我的实现思路：

对于一个网络爬虫，如果要按广度遍历的方式下载，它是这样的：

1.从给定的入口网址把第一个网页下载下来

2.从第一个网页中提取出所有新的网页地址，放入下载列表中

3.按下载列表中的地址，下载所有新的网页

4.从所有新的网页中找出没有下载过的网页地址，更新下载列表

5.重复3、4两步，直到更新后的下载列表为空表时停止

python代码如下：

#!/usr/bin/env python
#coding=utf-8
import threading
import urllib
import re
import time
g_mutex=threading.Condition()
g_pages=[] #从中解析所有url链接
g_queueURL=[] #等待爬取的url链接列表
g_existURL=[] #已经爬取过的url链接列表
g_failedURL=[] #下载失败的url链接列表
g_totalcount=0 #下载过的页面数
class Crawler:
  def __init__(self,crawlername,url,threadnum):
    self.crawlername=crawlername
    self.url=url
    self.threadnum=threadnum
    self.threadpool=[]
    self.logfile=file("log.txt",'w')
  def craw(self):
    global g_queueURL
    g_queueURL.append(url)  
    depth=0
    print self.crawlername+" 启动..."
    while(len(g_queueURL)!=0):
      depth+=1
      print 'Searching depth ',depth,'...\n\n'
      self.logfile.write("URL:"+g_queueURL[0]+"........")
      self.downloadAll()
      self.updateQueueURL()
      content='\n>>>Depth '+str(depth)+':\n'
      self.logfile.write(content)
      i=0
      while i<len(g_queueURL):
        content=str(g_totalcount+i)+'->'+g_queueURL[i]+'\n'
        self.logfile.write(content)
        i+=1
  def downloadAll(self):
    global g_queueURL
    global g_totalcount
    i=0
    while i<len(g_queueURL):
      j=0
      while j<self.threadnum and i+j < len(g_queueURL):
        g_totalcount+=1
        threadresult=self.download(g_queueURL[i+j],str(g_totalcount)+'.html',j)
        if threadresult!=None:
          print 'Thread started:',i+j,'--File number =',g_totalcount
        j+=1
      i+=j
      for thread in self.threadpool:
        thread.join(30)
      threadpool=[]
    g_queueURL=[]
  def download(self,url,filename,tid):
    crawthread=CrawlerThread(url,filename,tid)
    self.threadpool.append(crawthread)
    crawthread.start()
  def updateQueueURL(self):
    global g_queueURL
    global g_existURL
    newUrlList=[]
    for content in g_pages:
      newUrlList+=self.getUrl(content)
    g_queueURL=list(set(newUrlList)-set(g_existURL))  
  def getUrl(self,content):
    reg=r'"(http://.+?)"'
    regob=re.compile(reg,re.DOTALL)
    urllist=regob.findall(content)
    return urllist
class CrawlerThread(threading.Thread):
  def __init__(self,url,filename,tid):
    threading.Thread.__init__(self)
    self.url=url
    self.filename=filename
    self.tid=tid
  def run(self):
    global g_mutex
    global g_failedURL
    global g_queueURL
    try:
      page=urllib.urlopen(self.url)
      html=page.read()
      fout=file(self.filename,'w')
      fout.write(html)
      fout.close()
    except Exception,e:
      g_mutex.acquire()
      g_existURL.append(self.url)
      g_failedURL.append(self.url)
      g_mutex.release()
      print 'Failed downloading and saving',self.url
      print e
      return None
    g_mutex.acquire()
    g_pages.append(html)
    g_existURL.append(self.url)
    g_mutex.release()
if __name__=="__main__":
  url=raw_input("请输入url入口:\n")
  threadnum=int(raw_input("设置线程数:"))
  crawlername="小小爬虫"
  crawler=Crawler(crawlername,url,threadnum)
  crawler.craw()

以上代码就是给大家分享的基python实现多线程网页爬虫，希望大家喜欢。

基python实现多线程网页爬虫

- Author -

糖拌咸鱼

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

深入讲解Python中面向对象编程的相关知识

May 25 Python

Python读取网页内容的方法

Jul 30 Python

python中使用序列的方法

Aug 03 Python

Windows安装Python、pip、easy_install的方法

Mar 05 Python

Python基于pycrypto实现的AES加密和解密算法示例

Apr 10 Python

Python实现时钟显示效果思路详解

Apr 11 Python

python 利用栈和队列模拟递归的过程

May 29 Python

详解Python if-elif-else知识点

Jun 11 Python

python爬虫的一个常见简单js反爬详解

Jul 09 Python

Python多线程模块Threading用法示例小结

Nov 09 Python

Pycharm debug调试时带参数过程解析

Feb 03 Python

python入门教程之基本算术运算符

Nov 13 Python

python杀死一个线程的方法

Sep 06 #Python

在Python的Flask框架中验证注册用户的Email的方法

Sep 02 #Python

Python实现身份证号码解析

Sep 01 #Python

实例Python处理XML文件的方法

Aug 31 #Python

通过实例浅析Python对比C语言的编程思想差异

Aug 30 #Python

使用Python脚本将文字转换为图片的实例分享

Aug 29 #Python

Python中常见的数据类型小结

Aug 29 #Python

You might like

php随机获取金山词霸每日一句的方法

2015/07/09 PHP

javascript 文本框水印/占位符(watermark/placeholder)实现方法

2012/01/15 Javascript

jQuery中data()方法用法实例

2014/12/27 Javascript

关于JS中prototype的理解

2015/09/07 Javascript

javascript使用闭包模拟对象的私有属性和方法

2016/10/05 Javascript

js 实现获取name 相同的页面元素并循环遍历的方法

2017/02/14 Javascript

JavaScript中值类型和引用类型的区别

2017/02/23 Javascript

BootStrap TreeView使用实例详解

2017/11/01 Javascript

vue项目中的webpack-dev-sever配置方法

2017/12/14 Javascript

详解Vue取消eslint语法限制

2018/08/04 Javascript

微信小程序前端promise封装代码实例

2019/08/24 Javascript

[05:05]DOTA2亚洲邀请赛战队出场仪式

2015/02/07 DOTA

[10:14]2018DOTA2国际邀请赛寻真——paiN Gaming不仅为自己而战

2018/08/14 DOTA

[02:12]2019完美世界全国高校联赛（春季赛）报名开启

2019/03/01 DOTA

使用python搭建Django应用程序步骤及版本冲突问题解决

2013/11/19 Python

无法使用pip命令安装python第三方库的原因及解决方法

2018/06/12 Python

Python利用递归实现文件的复制方法

2018/10/27 Python

python把转列表为集合的方法

2019/06/28 Python

python利用JMeter测试Tornado的多线程

2020/01/12 Python

python3.6连接mysql数据库及增删改查操作详解

2020/02/10 Python

浅谈matplotlib 绘制梯度下降求解过程

2020/07/12 Python

在django中实现choices字段获取对应字段值

2020/07/12 Python

Python 通过正则表达式快速获取电影的下载地址

2020/08/17 Python

PyQt5结合matplotlib绘图的实现示例

2020/09/15 Python

解决python3.x安装numpy成功但import出错的问题

2020/11/17 Python

css3遮罩层镂空效果的多种实现方法

2020/05/11 HTML / CSS

Linden Leaves官网：新西兰纯净护肤品

2020/12/20 全球购物

代办委托书怎样写

2014/04/08 职场文书

班委竞选演讲稿

2014/04/28 职场文书

烹饪大赛策划方案

2014/05/26 职场文书

党员个人对照检查材料范文

2014/09/24 职场文书

上课睡觉万能检讨书

2015/02/17 职场文书

2015年护士工作总结范文

2015/03/31 职场文书

消防安全月活动总结

2015/05/08 职场文书

2016年党员承诺书范文

2016/03/24 职场文书

CSS实现鼠标悬浮动画特效

2023/05/07 HTML / CSS