编程 Python

Python爬虫包 BeautifulSoup 递归抓取实例详解

Posted in Python onJanuary 28, 2017

Python爬虫包 BeautifulSoup 递归抓取实例详解

概要：

爬虫的主要目的就是为了沿着网络抓取需要的内容。它们的本质是一种递归的过程。它们首先需要获得网页的内容，然后分析页面内容并找到另一个URL，然后获得这个URL的页面内容，不断重复这一个过程。

让我们以维基百科为一个例子。

我们想要将维基百科中凯文·贝肯词条里所有指向别的词条的链接提取出来。

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-25 10:35:00
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-25 10:52:26
from urllib2 import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bsObj = BeautifulSoup(html, "html.parser")

for link in bsObj.findAll("a"):
  if 'href' in link.attrs:
    print link.attrs['href']

上面这个代码能够将页面上的所有超链接都提取出来。

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick

首先，提取出来的URL可能会有一些重复的

其次，有一些URL是我们不需要的，如侧边栏、页眉、页脚、目录栏链接等等。

所以通过观察，我们可以发现所有指向词条页面的链接都有三个特点：

它们都在id是bodyContent的div标签里
URL链接不包含冒号
URL链接都是以/wiki/开头的相对路径（也会爬到完整的有http开头的绝对路径）

from urllib2 import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

pages = set()
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
  html = urlopen("http://en.wikipedia.org"+articleUrl)
  bsObj = BeautifulSoup(html, "html.parser")
  return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
  newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
  if newArticle not in pages:
    print(newArticle)
    pages.add(newArticle)
    links = getLinks(newArticle)

其中getLinks的参数是/wiki/<词条名称>，并通过和维基百科的绝对路径合并得到页面的URL。通过正则表达式捕获所有指向其他词条的URL，并返回给主函数。

主函数则通过调用递归getlinks并随机访问一条没有访问过的URL，直到没有了词条或者主动停止为止。

这份代码可以将整个维基百科都抓取下来

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
  global pages
  html = urlopen("http://en.wikipedia.org"+pageUrl)
  bsObj = BeautifulSoup(html, "html.parser")
  try:
    print(bsObj.h1.get_text())
    print(bsObj.find(id ="mw-content-text").findAll("p")[0])
    print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
  except AttributeError:
    print("This page is missing something! No worries though!")

  for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
    if 'href' in link.attrs:
      if link.attrs['href'] not in pages:
        #We have encountered a new page
        newPage = link.attrs['href']
        print("----------------\n"+newPage)
        pages.add(newPage)
        getLinks(newPage)
getLinks("")

一般来说Python的递归限制是1000次，所以需要人为地设置一个较大的递归计数器，或者用其他手段让代码在迭代1000次之后还能运行。

感谢阅读，希望能帮助到大家，谢谢大家对本站的支持！

Python爬虫包 BeautifulSoup 递归抓取实例详解

- Author -

lqh

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

SublimeText 2编译python出错的解决方法（The system cannot find the file specified）

Nov 27 Python

Python使用scrapy抓取网站sitemap信息的方法

Apr 08 Python

基于python select.select模块通信的实例讲解

Sep 21 Python

python全栈要学什么 python全栈学习路线

Jun 28 Python

python正则-re的用法详解

Jul 28 Python

Python Numpy 控制台完全输出ndarray的实现

Feb 19 Python

python编写俄罗斯方块

Mar 13 Python

Jupyter Notebook输出矢量图实例

Apr 14 Python

Python如何实现线程间通信

Jul 30 Python

Pytorch之Tensor和Numpy之间的转换的实现方法

Sep 03 Python

python 如何在测试中使用 Mock

Mar 01 Python

Pytest中skip和skipif的具体使用方法

Jun 30 Python

python 编程之twisted详解及简单实例

Jan 28 #Python

详解python之简单主机批量管理工具

Jan 27 #Python

Python下的Softmax回归函数的实现方法(推荐)

Jan 26 #Python

在Django同1个页面中的多表单处理详解

Jan 25 #Python

Python heapq使用详解及实例代码

Jan 25 #Python

python3+PyQt5实现使用剪贴板做复制与粘帖示例

Jan 24 #Python

Python调用C++程序的方法详解

Jan 24 #Python

You might like

深入理解PHP中的Streams工具

2015/07/03 PHP

Yii2框架配置文件(Application属性)与调试技巧实例分析

2019/05/27 PHP

解决jQuery使用JSONP时产生的错误

2015/12/02 Javascript

移动端js触摸事件详解

2016/09/18 Javascript

探索Javascript中this的奥秘

2016/12/11 Javascript

JS实现改变HTML上文字颜色和内容的方法

2016/12/30 Javascript

原生js轮播（仿慕课网）

2017/02/15 Javascript

详解Vue用axios发送post请求自动set cookie

2017/05/10 Javascript

Intellij IDEA搭建vue-cli项目的方法步骤

2018/10/20 Javascript

详解基于Wepy开发小程序插件(推荐)

2019/08/01 Javascript

vue实现select下拉显示隐藏功能

2019/09/30 Javascript

js实现图片跟随鼠标移动效果

2019/10/16 Javascript

基于ant design日期控件使用_仅月份的操作

2020/10/27 Javascript

Vue如何循环提取对象数组中的值

2020/11/18 Vue.js

使用基于Python的Tornado框架的HTTP客户端的教程

2015/04/24 Python

用Python操作字符串之rindex()方法的使用

2015/05/19 Python

Python中max函数用法实例分析

2015/07/17 Python

对python3中的RE(正则表达式)-详细总结

2019/07/23 Python

对tensorflow 中tile函数的使用详解

2020/02/07 Python

详解如何修改jupyter notebook的默认目录和默认浏览器

2021/01/24 Python

CSS3实现银灰色动画效果的导航菜单代码

2015/09/01 HTML / CSS

详解CSS3选择器的使用方法汇总

2015/11/24 HTML / CSS

HTML5 Canvas实现平移/放缩/旋转deom示例(附截图)

2013/07/04 HTML / CSS

国际领先的学术出版商：Springer

2017/01/11 全球购物

澳大利亚拥有最佳跳伞降落点和最好服务的跳伞项目运营商：Skydive Australia

2018/03/05 全球购物

意大利在线药房：shop-farmacia.it

2019/03/12 全球购物

Yankee Candle官网：美国最畅销蜡烛品牌之一

2020/01/05 全球购物

Oracle里面常用的数据字典有哪些

2014/02/14 面试题

C#面试常见问题

2013/02/25 面试题

法学求职信

2014/06/22 职场文书

党员评议表自我评价范文

2014/10/20 职场文书

2015年党总支工作总结

2015/05/25 职场文书

中职班主任培训心得体会

2016/01/07 职场文书

《从现在开始》教学反思

2016/02/16 职场文书

详解Android中的TimePickerView(时间选择器)的用法

2022/04/30 Java/Android

MySQL数据库如何查看表占用空间大小

2022/06/10 MySQL