编程 Python

python定向爬虫校园论坛帖子信息

Posted in Python onJuly 23, 2018

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息，主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url（包含page页面参数）来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式，文本格式都为 &#XXXX;的形式，所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite数据库：saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬虫代码

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站，伪造一个header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python定向爬虫校园论坛帖子信息

- Author -

lannooooooooooo

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python合并两个字典的常用方法与效率比较

Jun 17 Python

Flask框架的学习指南之制作简单blog系统

Nov 20 Python

Python常用时间操作总结【取得当前时间、时间函数、应用等】

May 11 Python

人脸识别经典算法一特征脸方法（Eigenface）

Mar 13 Python

Python实现的朴素贝叶斯算法经典示例【测试可用】

Jun 13 Python

在Django model中设置多个字段联合唯一约束的实例

Jul 17 Python

Python类中方法getitem和getattr详解

Aug 30 Python

python网络爬虫 CrawlSpider使用详解

Sep 27 Python

windows环境中利用celery实现简单任务队列过程解析

Nov 29 Python

Python pip安装模块提示错误解决方案

May 22 Python

plt.figure()参数使用详解及运行演示

Jan 08 Python

python如何利用cv2.rectangle()绘制矩形框

Dec 24 Python

python实现图片批量压缩程序

Jul 23 #Python

python中的插值 scipy-interp的实现代码

Jul 23 #Python

Flask框架URL管理操作示例【基于@app.route】

Jul 23 #Python

python中的turtle库函数简单使用教程

Jul 23 #Python

Flask框架配置与调试操作示例

Jul 23 #Python

python实现时间o(1)的最小栈的实例代码

Jul 23 #Python

Flask框架Flask-Principal基本用法实例分析

Jul 23 #Python

You might like

我的群发邮件程序

2006/10/09 PHP

phpmyadmin安装时提示：Warning: require_once(./libraries/common.inc.php)错误解决办法

2011/08/18 PHP

PHP5中GD库生成图形验证码(有汉字)

2013/07/28 PHP

PHP的foreach中使用引用时需要注意的一个问题和解决方法

2014/05/29 PHP

thinkPHP框架通过Redis实现增删改查操作的方法详解

2019/05/13 PHP

玩转jQuery按钮请告诉我你最喜欢哪些？

2012/01/08 Javascript

JavaScript将Table导出到Excel实现思路及代码

2013/03/13 Javascript

Js动态添加复选框Checkbox的实例方法

2013/04/08 Javascript

jsonp原理及使用

2013/10/28 Javascript

js控制不同的时间段显示不同的css样式的实例代码

2013/11/04 Javascript

js鼠标点击图片切换效果代码分享

2015/08/26 Javascript

Extjs 点击复选框在表格中增加相关信息行

2016/07/12 Javascript

使用bootstrap validator的remote验证代码经验分享(推荐)

2016/09/21 Javascript

ES6 迭代器(Iterator)和 for.of循环使用方法学习(总结)

2018/02/08 Javascript

使用ESLint禁止项目导入特定模块的方法步骤

2019/03/04 Javascript

自定义Vue中的v-module双向绑定的实现

2019/04/17 Javascript

jQuery实现滑动开关效果

2020/08/02 jQuery

利用H5api实现时钟的绘制(javascript)

2020/09/13 Javascript

python基础教程之面向对象的一些概念

2014/08/29 Python

python3抓取中文网页的方法

2015/07/28 Python

Python的爬虫包Beautiful Soup中用正则表达式来搜索

2016/01/20 Python

python爬虫 urllib模块url编码处理详解

2019/08/20 Python

Python语法之精妙的十个知识点(装B语法)

2020/01/18 Python

OpenCV+python实现膨胀和腐蚀的示例

2020/12/21 Python

html5手机端页面可以向右滑动导致样式受影响的问题

2018/06/20 HTML / CSS

HTML文本属性&颜色控制属性的实现

2019/12/17 HTML / CSS

英国最大的线上保健品零售商之一：Vitamin Planet

2016/12/01 全球购物

国外最大的眼镜网站：Coastal

2017/08/09 全球购物

幼师求职自荐信范文

2014/01/26 职场文书

追悼会子女答谢词

2014/01/28 职场文书

阿德的梦教学反思

2014/02/06 职场文书

员工激励培训演讲稿

2014/09/16 职场文书

个人债务授权委托书

2014/10/17 职场文书

2014财务年终工作总结

2014/12/08 职场文书

初中班长竞选稿

2015/11/20 职场文书

接触艺术对孩子学习思维有益

2019/08/06 职场文书