编程 Python

python爬虫实例详解

Posted in Python onJune 19, 2018

本篇博文主要讲解Python爬虫实例，重点包括爬虫技术架构，组成爬虫的关键模块：URL管理器、HTML下载器和HTML解析器。

爬虫简单架构

python爬虫实例详解

程序入口函数(爬虫调度段)

#coding:utf8
import time, datetime

from maya_Spider import url_manager, html_downloader, html_parser, html_outputer


class Spider_Main(object):
 #初始化操作
 def __init__(self):
  #设置url管理器
  self.urls = url_manager.UrlManager()
  #设置HTML下载器
  self.downloader = html_downloader.HtmlDownloader()
  #设置HTML解析器
  self.parser = html_parser.HtmlParser()
  #设置HTML输出器
  self.outputer = html_outputer.HtmlOutputer()

 #爬虫调度程序
 def craw(self, root_url):
  count = 1
  self.urls.add_new_url(root_url)
  while self.urls.has_new_url():
   try:
    new_url = self.urls.get_new_url()
    print('craw %d : %s' % (count, new_url))
    html_content = self.downloader.download(new_url)
    new_urls, new_data = self.parser.parse(new_url, html_content)
    self.urls.add_new_urls(new_urls)
    self.outputer.collect_data(new_data)

    if count == 10:
     break

    count = count + 1
   except:
    print('craw failed')

  self.outputer.output_html()

if __name__ == '__main__':
 #设置爬虫入口
 root_url = 'http://baike.baidu.com/view/21087.htm'
 #开始时间
 print('开始计时..............')
 start_time = datetime.datetime.now()
 obj_spider = Spider_Main()
 obj_spider.craw(root_url)
 #结束时间
 end_time = datetime.datetime.now()
 print('总用时：%ds'% (end_time - start_time).seconds)

URL管理器

class UrlManager(object):
 def __init__(self):
  self.new_urls = set()
  self.old_urls = set()

 def add_new_url(self, url):
  if url is None:
   return
  if url not in self.new_urls and url not in self.old_urls:
   self.new_urls.add(url)

 def add_new_urls(self, urls):
  if urls is None or len(urls) == 0:
   return
  for url in urls:
   self.add_new_url(url)

 def has_new_url(self):
  return len(self.new_urls) != 0

 def get_new_url(self):
  new_url = self.new_urls.pop()
  self.old_urls.add(new_url)
  return new_url

网页下载器

import urllib
import urllib.request

class HtmlDownloader(object):

 def download(self, url):
  if url is None:
   return None

  #伪装成浏览器访问，直接访问的话csdn会拒绝
  user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
  headers = {'User-Agent':user_agent}
  #构造请求
  req = urllib.request.Request(url,headers=headers)
  #访问页面
  response = urllib.request.urlopen(req)
  #python3中urllib.read返回的是bytes对象，不是string,得把它转换成string对象，用bytes.decode方法
  return response.read().decode()

网页解析器

import re
import urllib
from urllib.parse import urlparse

from bs4 import BeautifulSoup

class HtmlParser(object):

 def _get_new_urls(self, page_url, soup):
  new_urls = set()
  #/view/123.htm
  links = soup.find_all('a', href=re.compile(r'/item/.*?'))
  for link in links:
   new_url = link['href']
   new_full_url = urllib.parse.urljoin(page_url, new_url)
   new_urls.add(new_full_url)
  return new_urls

 #获取标题、摘要
 def _get_new_data(self, page_url, soup):
  #新建字典
  res_data = {}
  #url
  res_data['url'] = page_url
  #<dd class="lemmaWgt-lemmaTitle-title"><h1>Python</h1>获得标题标签
  title_node = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find('h1')
  print(str(title_node.get_text()))
  res_data['title'] = str(title_node.get_text())
  #<div class="lemma-summary" label-module="lemmaSummary">
  summary_node = soup.find('div', class_="lemma-summary")
  res_data['summary'] = summary_node.get_text()

  return res_data

 def parse(self, page_url, html_content):
  if page_url is None or html_content is None:
   return None

  soup = BeautifulSoup(html_content, 'html.parser', from_encoding='utf-8')
  new_urls = self._get_new_urls(page_url, soup)
  new_data = self._get_new_data(page_url, soup)
  return new_urls, new_data

网页输出器

class HtmlOutputer(object):

 def __init__(self):
  self.datas = []

 def collect_data(self, data):
  if data is None:
   return
  self.datas.append(data )

 def output_html(self):
  fout = open('maya.html', 'w', encoding='utf-8')
  fout.write("<head><meta http-equiv='content-type' content='text/html;charset=utf-8'></head>")
  fout.write('<html>')
  fout.write('<body>')
  fout.write('<table border="1">')
  # <th width="5%">Url</th>
  fout.write('''<tr style="color:red" width="90%">
     <th>Theme</th>
     <th width="80%">Content</th>
     </tr>''')
  for data in self.datas:
   fout.write('<tr>\n')
   # fout.write('\t<td>%s</td>' % data['url'])
   fout.write('\t<td align="center"><a href=\'%s\'>%s</td>' % (data['url'], data['title']))
   fout.write('\t<td>%s</td>\n' % data['summary'])
   fout.write('</tr>\n')
  fout.write('</table>')
  fout.write('</body>')
  fout.write('</html>')
  fout.close()

运行结果

python爬虫实例详解

附：完整代码

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python爬虫实例详解

- Author -

孙华强

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python中使用pprint函数进行格式化输出的教程

Apr 07 Python

Python中的map()函数和reduce()函数的用法

Apr 27 Python

Python获取linux主机ip的简单实现方法

Apr 18 Python

Python获取系统所有进程PID及进程名称的方法示例

May 24 Python

对web.py设置favicon.ico的方法详解

Dec 04 Python

Python中判断子串存在的性能比较及分析总结

Jun 23 Python

python交易记录整合交易类详解

Jul 03 Python

softmax及python实现过程解析

Sep 30 Python

Python模块future用法原理详解

Jan 20 Python

如何以Winsows Service方式运行JupyterLab

Aug 30 Python

Python 高效编程技巧分享

Sep 10 Python

解决pytorch 数据类型报错的问题

Mar 03 Python

Python实现的NN神经网络算法完整示例

Jun 19 #Python

python中的二维列表实例详解

Jun 19 #Python

Tensorflow中使用tfrecord方式读取数据的方法

Jun 19 #Python

python3实现SMTP发送邮件详细教程

Jun 19 #Python

Python SVM(支持向量机)实现方法完整示例

Jun 19 #Python

Tensorflow使用tfrecord输入数据格式

Jun 19 #Python

Tensorflow 训练自己的数据集将数据直接导入到内存

Jun 19 #Python

You might like

Syphon 秘笈

2021/03/03 冲泡冲煮

PHP中实现中文字符进制转换原理分析

2011/12/06 PHP

PHP实现的迷你漂流瓶

2015/07/29 PHP

一个JQuery操作Table的代码分享

2012/03/30 Javascript

7个让JavaScript变得更好的注意事项

2015/01/28 Javascript

JavaScript实现仿淘宝商品购买数量的增减效果

2016/01/22 Javascript

js仿小米官网图片轮播特效

2016/09/29 Javascript

Angular JS 生成动态二维码的方法

2017/02/23 Javascript

AngularJS实现单一页面内设置跳转路由的方法

2017/06/28 Javascript

使用JS实现图片轮播的实例(前后首尾相接)

2017/09/21 Javascript

layui 设置table 行的高度方法

2018/08/17 Javascript

解决layui中table异步数据请求不支持自定义返回数据格式的问题

2018/08/19 Javascript

vue+element树组件实现树懒加载的过程详解

2019/10/21 Javascript

[03:00]DOTA2-DPC中国联赛1月18日Recap集锦

2021/03/11 DOTA

Python 连接字符串(join %)

2008/09/06 Python

使用Python对Excel进行读写操作

2017/03/30 Python

python执行精确的小数计算方法

2019/01/21 Python

OpenCV HSV颜色识别及HSV基本颜色分量范围

2019/03/22 Python

python 比较字典value的最大值的几种方法

2020/04/17 Python

python mysql中in参数化说明

2020/06/05 Python

基于TensorFlow的CNN实现Mnist手写数字识别

2020/06/17 Python

Python项目跨域问题解决方案

2020/06/22 Python

美国知名生活购物网站：Goop

2017/11/03 全球购物

德国隐形眼镜店：LuckyLens

2018/07/29 全球购物

伊索寓言教学反思

2014/05/01 职场文书

法人授权委托书

2014/09/16 职场文书

学习十八届四中全会依法治国心得体会

2014/11/03 职场文书

单位租房协议范本

2014/12/03 职场文书

2014年除四害工作总结

2014/12/06 职场文书

2015年党员自评材料

2014/12/17 职场文书

担保书格式

2015/01/20 职场文书

2016年端午节红领巾广播稿

2015/12/18 职场文书

如何自己动手写SQL执行引擎

2021/06/02 MySQL

再次探讨go实现无限 buffer 的 channel方法

2021/06/13 Golang

python随机打印成绩排名表

2021/06/23 Python

Python数据结构之队列详解

2022/03/21 Python