python实现博客文章爬虫示例


Posted in Python onFebruary 26, 2014
#!/usr/bin/python
#-*-coding:utf-8-*-
# JCrawler
# Author: Jam <810441377@qq.com>
import time
import urllib2
from bs4 import BeautifulSoup
# 目标站点
TargetHost = "http://adirectory.blog.com"
# User Agent
UserAgent  = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36'
# 链接采集规则
# 目录链接采集规则
CategoryFind    = [{'findMode':'find','findTag':'div','rule':{'id':'cat-nav'}},
                   {'findMode':'findAll','findTag':'a','rule':{}}]
# 文章链接采集规则
ArticleListFind = [{'findMode':'find','findTag':'div','rule':{'id':'content'}},
                   {'findMode':'findAll','findTag':'h2','rule':{'class':'title'}},
                   {'findMode':'findAll','findTag':'a','rule':{}}]
# 分页URL规则
PageUrl  = 'page/#page/'
PageStart = 1
PageStep  = 1
PageStopHtml = '404: Page Not Found'
def GetHtmlText(url):
    request  = urllib2.Request(url)
    request.add_header('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp")
    request.add_header('Accept-Encoding', "*")
    request.add_header('User-Agent', UserAgent)
    return urllib2.urlopen(request).read()
def ArrToStr(varArr):
    returnStr = ""
    for s in varArr:
        returnStr += str(s)
    return returnStr

def GetHtmlFind(htmltext, findRule):
    findReturn = BeautifulSoup(htmltext)
    returnText = ""
    for f in findRule:
        if returnText != "":
            findReturn = BeautifulSoup(returnText)
        if f['findMode'] == 'find':
            findReturn = findReturn.find(f['findTag'], f['rule'])
        if f['findMode'] == 'findAll':
            findReturn = findReturn.findAll(f['findTag'], f['rule'])
        returnText = ArrToStr(findReturn)
    return findReturn
def GetCategory():
    categorys = [];
    htmltext = GetHtmlText(TargetHost)
    findReturn = GetHtmlFind(htmltext, CategoryFind)
    for tag in findReturn:
        print "[G]->Category:" + tag.string + "|Url:" + tag['href']
        categorys.append({'name': tag.string, 'url': tag['href']})
    return categorys;
def GetArticleList(categoryUrl):
    articles = []
    page = PageStart
    #pageUrl = PageUrl
    while True:
        htmltext = ""
        pageUrl  = PageUrl.replace("#page", str(page))
        print "[G]->PageUrl:" + categoryUrl + pageUrl
        while True:
            try:
                htmltext = GetHtmlText(categoryUrl + pageUrl)
                break
            except urllib2.HTTPError,e:
                print "[E]->HTTP Error:" + str(e.code)
                if e.code == 404:
                    htmltext = PageStopHtml
                    break
                if e.code == 504:
                    print "[E]->HTTP Error 504: Gateway Time-out, Wait"
                    time.sleep(5)
                else:
                    break
        if htmltext.find(PageStopHtml) >= 0:
            print "End Page."
            break
        else:
            findReturn = GetHtmlFind(htmltext, ArticleListFind)
            for tag in findReturn:
                if tag.string != None and tag['href'].find(TargetHost) >= 0:
                    print "[G]->Article:" + tag.string + "|Url:" + tag['href']
                    articles.append({'name': tag.string, 'url': tag['href']})
            page += 1
    return articles;
print "[G]->GetCategory"
Mycategorys = GetCategory();
print "[G]->GetCategory->Success."
time.sleep(3)
for category in Mycategorys:
   print "[G]->GetArticleList:" + category['name']
   GetArticleList(category['url'])
Python 相关文章推荐
python list语法学习(带例子)
Nov 01 Python
PyQt5实现无边框窗口的标题拖动和窗口缩放
Apr 19 Python
浅析Python数据处理
May 02 Python
解决Python2.7中IDLE启动没有反应的问题
Nov 30 Python
Python标准库使用OrderedDict类的实例讲解
Feb 14 Python
详解centos7+django+python3+mysql+阿里云部署项目全流程
Nov 15 Python
flask的orm框架SQLAlchemy查询实现解析
Dec 12 Python
Tensorflow的常用矩阵生成方式
Jan 04 Python
Pycharm 2020最新永久激活码(附最新激活码和插件)
Sep 17 Python
IntelliJ 中配置 Anaconda的过程图解
Jun 01 Python
keras model.fit 解决validation_spilt=num 的问题
Jun 19 Python
pytorch DataLoader的num_workers参数与设置大小详解
May 28 Python
python处理中文编码和判断编码示例
Feb 26 #Python
python实现网页链接提取的方法分享
Feb 25 #Python
python3模拟百度登录并实现百度贴吧签到示例分享(百度贴吧自动签到)
Feb 24 #Python
python实现socket客户端和服务端简单示例
Feb 24 #Python
python抓取网页内容示例分享
Feb 24 #Python
使用python装饰器验证配置文件示例
Feb 24 #Python
python通过urllib2爬网页上种子下载示例
Feb 24 #Python
You might like
FCKeditor添加自定义按钮
2008/03/27 PHP
PHP对象Object的概念 介绍
2012/06/14 PHP
php并发对MYSQL造成压力的解决方法
2013/02/21 PHP
PHP中VC6、VC9、TS、NTS版本的区别与用法详解
2013/10/26 PHP
php调用google接口生成二维码示例
2014/04/28 PHP
php微信支付之APP支付方法
2015/03/04 PHP
KindEditor在php环境下上传图片功能集成的方法示例
2020/07/20 PHP
Mootools 图片展示插件(lightbox,ImageMenu)收集集合
2010/05/21 Javascript
javascript实现图片切换的幻灯片效果源代码
2012/12/12 Javascript
元素绑定click点击事件方法
2015/06/08 Javascript
详解JavaScript节流函数中的Throttle
2016/07/16 Javascript
jquery对象和DOM对象的相互转换详解
2016/10/18 Javascript
js记录点击某个按钮的次数-刷新次数为初始状态的实例
2017/02/15 Javascript
NodeJS 实现手机短信验证模块阿里大于功能
2017/06/19 NodeJs
微信JS SDK接入的几点注意事项(必看篇)
2017/06/23 Javascript
React利用插件和不用插件实现双向绑定的方法详解
2017/07/03 Javascript
jQuery 循环遍历改变a标签的href(实例讲解)
2017/07/12 jQuery
详解Vue webapp项目通过HBulider打包原生APP
2018/06/29 Javascript
vue项目中使用lib-flexible解决移动端适配的问题解决
2018/08/23 Javascript
详解如何在Node.js的httpServer中接收前端发送的arraybuffer数据
2018/11/11 Javascript
利用百度echarts实现图表功能简单入门示例【附源码下载】
2019/06/10 Javascript
nest.js 使用express需要提供多个静态目录的操作方法
2019/10/24 Javascript
详解node登录接口之密码错误限制次数(含代码)
2019/10/25 Javascript
解决vue组件没显示,没起作用,没报错,但该显示的组件没显示问题
2020/09/02 Javascript
详解Python中内置的NotImplemented类型的用法
2015/03/31 Python
Python实现模拟登录网易邮箱的方法示例
2018/07/05 Python
Python开发之Nginx+uWSGI+virtualenv多项目部署教程
2019/05/13 Python
详解PyCharm安装MicroPython插件的教程
2019/06/24 Python
详解CSS3媒体查询响应式布局bootstrap 框架原理实战(推荐)
2020/11/16 HTML / CSS
洗发露广告词
2014/03/14 职场文书
竞聘书格式及范文
2014/03/31 职场文书
安全资料员岗位职责范本
2014/06/28 职场文书
负责培养人意见
2015/06/05 职场文书
Python 快速验证代理IP是否有效的方法实现
2021/07/15 Python
python ConfigParser库的使用及遇到的坑
2022/02/12 Python
python3 字符串str和bytes相互转换
2022/03/23 Python