Python爬虫包BeautifulSoup实例(三)


Posted in Python onJune 17, 2018

一步一步构建一个爬虫实例,抓取糗事百科的段子

先不用beautifulsoup包来进行解析

第一步,访问网址并抓取源码

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()
  print content.decode('utf-8')

第二步,利用正则表达式提取信息

首先先观察源码中,你需要的内容的位置以及如何识别
然后用正则表达式去识别读取
注意正则表达式中的 . 是不能匹配\n的,所以需要设置一下匹配模式。

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()

  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  items = re.findall(regex, content)

  # 提取数据
  # 注意换行符,设置 . 能够匹配换行符
  for item in items:
    print item

第三步,修正数据并保存到文件中

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:41:32

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()

  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  items = re.findall(regex, content)

  # 提取数据
  # 注意换行符,设置 . 能够匹配换行符
  path = './qiubai'
  if not os.path.exists(path):
    os.makedirs(path)
  count = 1
  for item in items:
    #整理数据,去掉\n,将<br/>换成\n
    item = item.replace('\n', '').replace('<br/>', '\n')
    filepath = path + '/' + str(count) + '.txt'
    f = open(filepath, 'w')
    f.write(item)
    f.close()
    count += 1

第四步,将多个页面下的内容都抓取下来

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  path = './qiubai'
  if not os.path.exists(path):
    os.makedirs(path)
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  count = 1
  for cnt in range(1, 35):
    print '第' + str(cnt) + '轮'
    url = 'http://www.qiushibaike.com/textnew/page/' + str(cnt) + '/?s=4941357'
    try:
      request = urllib2.Request(url = url, headers = headers)
      response = urllib2.urlopen(request)
      content = response.read()
    except urllib2.HTTPError as e:
      print e
      exit()
    except urllib2.URLError as e:
      print e
      exit()
    # print content

    # 提取数据
    # 注意换行符,设置 . 能够匹配换行符
    items = re.findall(regex, content)

    # 保存信息
    for item in items:
      #  print item
      #整理数据,去掉\n,将<br/>换成\n
      item = item.replace('\n', '').replace('<br/>', '\n')
      filepath = path + '/' + str(count) + '.txt'
      f = open(filepath, 'w')
      f.write(item)
      f.close()
      count += 1

  print '完成'

使用BeautifulSoup对源码进行解析

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:34:02

import urllib
import urllib2
import re
import os
from bs4 import BeautifulSoup

if __name__ == '__main__':
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  request = urllib2.Request(url = url, headers = headers)
  response = urllib2.urlopen(request)
  # print response.read()
  soup_packetpage = BeautifulSoup(response, 'lxml')
  items = soup_packetpage.find_all("div", class_="content")

  for item in items:
    try:
      content = item.span.string
    except AttributeError as e:
      print e
      exit()

    if content:
      print content + "\n"

这是用BeautifulSoup去抓取书本以及其价格的代码
可以通过对比得出到bs4对标签的读取以及标签内容的读取
(因为我自己也没有学到这一部分,目前只能依葫芦画瓢地写)

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 20:37:38
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:27:30
import urllib2
import urllib
import re 

from bs4 import BeautifulSoup 


url = "https://www.packtpub.com/all"
try:
  html = urllib2.urlopen(url) 
except urllib2.HTTPError as e:
  print e
  exit()

soup_packtpage = BeautifulSoup(html, 'lxml') 
all_book_title = soup_packtpage.find_all("div", class_="book-block-title") 

price_regexp = re.compile(u"\s+\$\s\d+\.\d+") 

for book_title in all_book_title: 
  try:
    print "Book's name is " + book_title.string.strip()
  except AttributeError as e:
    print e
    exit()
  book_price = book_title.find_next(text=price_regexp) 
  try:
    print "Book's price is "+ book_price.strip()
  except AttributeError as e:
    print e
    exit()
  print ""

以上全部为本篇文章的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python的re模块应用实例
Sep 26 Python
python数字图像处理之骨架提取与分水岭算法
Apr 27 Python
Django项目中包含多个应用时对url的配置方法
May 30 Python
Python求两个圆的交点坐标或三个圆的交点坐标方法
Nov 07 Python
浅谈numpy生成数组的零值问题
Nov 12 Python
python实现ip代理池功能示例
Jul 05 Python
Flask框架学习笔记之模板操作实例详解
Aug 15 Python
tensorflow使用指定gpu的方法
Feb 04 Python
Python通过队列来实现进程间通信的示例
Oct 14 Python
Numpy中np.random.rand()和np.random.randn() 用法和区别详解
Oct 23 Python
python实现图像高斯金字塔的示例代码
Dec 11 Python
Django中的JWT身份验证的实现
May 07 Python
Python爬虫包BeautifulSoup异常处理(二)
Jun 17 #Python
Python爬虫包BeautifulSoup简介与安装(一)
Jun 17 #Python
python主线程捕获子线程的方法
Jun 17 #Python
Python实现获取邮箱内容并解析的方法示例
Jun 16 #Python
Python实现自定义函数的5种常见形式分析
Jun 16 #Python
Python基于jieba库进行简单分词及词云功能实现方法
Jun 16 #Python
Python实现简单的文本相似度分析操作详解
Jun 16 #Python
You might like
用PHP实现的生成静态HTML速度快类库
2007/03/31 PHP
PHP实现简单实用的验证码类
2015/07/29 PHP
在php中设置session用memcache来存储的方法总结
2016/01/14 PHP
php正确输出json数据的实例讲解
2018/08/21 PHP
Thinkphp 在api开发中异常返回依然是html的解决方式
2019/10/16 PHP
js几个不错的函数 $$()
2006/10/09 Javascript
JavaScript脚本语言在网页中的简单应用
2007/05/13 Javascript
jQuery学习笔记(2)--用jquery实现各种模态提示框代码及项目构架
2013/04/08 Javascript
JS网页图片按比例自适应缩放实现方法
2014/01/15 Javascript
js打开新窗口方法整理
2014/02/17 Javascript
详解AngularJS中的作用域
2015/06/17 Javascript
jquery表单插件form使用方法详解
2017/01/20 Javascript
JS实现基于Sketch.js模拟成群游动的蝌蚪运动动画效果【附demo源码下载】
2017/08/18 Javascript
vue源码解析之事件机制原理
2018/04/21 Javascript
微信小程序组件生命周期的踩坑记录
2021/03/03 Javascript
[54:08]LGD女子刀塔学院 DOTA2炼金术士教学
2014/01/09 DOTA
[04:10]2018年度CS GO玩家最喜爱的主播-完美盛典
2018/12/16 DOTA
python使用两种发邮件的方式smtp和outlook示例
2017/06/02 Python
对Python中数组的几种使用方法总结
2018/06/28 Python
Python音频操作工具PyAudio上手教程详解
2019/06/26 Python
Python3列表List入门知识附实例
2020/02/09 Python
python-sys.stdout作为默认函数参数的实现
2020/02/21 Python
如何在pycharm中安装第三方包
2020/10/27 Python
Python用Jira库来操作Jira
2020/12/28 Python
如何利用python 读取配置文件
2021/01/06 Python
一款CSS3实现多功能下拉菜单(带分享按)的教程
2014/11/05 HTML / CSS
Muziker英国:中欧最大的音乐家商店
2020/02/05 全球购物
Flesh Beauty官网:露华浓集团旗下彩妆品牌
2021/02/15 全球购物
Java里面StringBuilder和StringBuffer有什么区别
2016/06/06 面试题
仪器仪表检测毕业生自荐信
2013/10/31 职场文书
工厂搬迁方案
2014/05/11 职场文书
李开复演讲稿
2014/05/24 职场文书
2014客服代表实习自我鉴定
2014/09/18 职场文书
2019年个人工作总结范文(3篇)
2019/08/27 职场文书
JavaScript前端面试组合函数
2022/06/21 Javascript
python计算列表元素与乘积详情
2022/08/05 Python