使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解


Posted in Python onJanuary 25, 2020

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最基础的内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、子节点

一个Tag可能包含多个字符串或者其他Tag,这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。

1.通过Tag的名字来获得Tag

print(soup.head)
print(soup.title)
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

通过名字的方法只能获得第一个Tag,如果要获得所有的某种Tag可以使用find_all方法

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents属性:将Tag的子节点通过列表的方式返回

head_tag = soup.head
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
<title>The Dormouse's story</title>
title_tag.contents
["The Dormouse's story"]

3.children:通过该属性对子节点进行循环

for child in title_tag.children:
  print(child)
The Dormouse's story

4.descendants: 不论是contents还是children都是返回直接子节点,而descendants对所有tag的子孙节点进行递归循环

for child in head_tag.children:
  print(child)
<title>The Dormouse's story</title>
for child in head_tag.descendants:
  print(child)
<title>The Dormouse's story</title>
The Dormouse's story

5.string 如果tag只有一个NavigableString类型的子节点,那么tag可以使用.string得到该子节点

title_tag.string
"The Dormouse's story"

如果一个tag只有一个子节点,那么使用.string可以获得其唯一子结点的NavigableString.

head_tag.string
"The Dormouse's story"

如果tag有多个子节点,tag无法确定.string对应的是那个子结点的内容,故返回None

print(soup.html.string)
None

6.strings和stripped_strings

如果tag包含多个字符串,可以使用.strings循环获取

for string in soup.strings:
  print(string)
The Dormouse's story


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

.string输出的内容包含了许多空格和空行,使用strpped_strings去除这些空白内容

for string in soup.stripped_strings:
  print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

二、父节点

1.parent:获得某个元素的父节点

title_tag = soup.title
title_tag.parent
<head><title>The Dormouse's story</title></head>

字符串也有父节点

title_tag.string.parent
<title>The Dormouse's story</title>

2.parents:递归的获得所有父辈节点

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)
p
body
html
[document]

三、兄弟结点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())
<html>
 <body>
 <a>
  <b>
  text1
  </b>
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling
<c>text2</c>
sibling_soup.c.previous_sibling
<b>text1</b>

在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n
',\n'
soup.a.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))
' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

四、回退与前进

1.next_element和previous_element

指向下一个或者前一个被解析的对象(字符串或tag),即深度优先遍历的后序节点和前序节点

last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)
;
and they lived at the bottom of a well.
Tillie
last_a_tag.previous_element
' and\n'

2.next_elements和previous_elements

通过.next_elements和previous_elements可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

for element in last_a_tag.next_elements:
  print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

更多关于使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的方法与文章大家可以点击下面的相关文章

Python 相关文章推荐
使用Python进行新浪微博的mid和url互相转换实例(10进制和62进制互算)
Apr 25 Python
跟老齐学Python之大话题小函数(2)
Oct 10 Python
MySQL中表的复制以及大型数据表的备份教程
Nov 25 Python
Python解惑之整数比较详解
Apr 24 Python
浅谈python和C语言混编的几种方式(推荐)
Sep 27 Python
使用Python+Splinter自动刷新抢12306火车票
Jan 03 Python
使用Python 统计高频字数的方法
Jan 31 Python
python sorted函数的小练习及解答
Sep 18 Python
python 实现保存最新的三份文件,其余的都删掉
Dec 22 Python
python GUI库图形界面开发之PyQt5动态加载QSS样式文件
Feb 25 Python
python小程序之4名牌手洗牌发牌问题解析
May 15 Python
Python grpc超时机制代码示例
Sep 14 Python
Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
Jan 25 #Python
Python爬虫库BeautifulSoup的介绍与简单使用实例
Jan 25 #Python
使用Python爬虫库requests发送表单数据和JSON数据
Jan 25 #Python
Python爬虫库requests获取响应内容、响应状态码、响应头
Jan 25 #Python
使用Python爬虫库requests发送请求、传递URL参数、定制headers
Jan 25 #Python
flask框架自定义url转换器操作详解
Jan 25 #Python
常用python爬虫库介绍与简要说明
Jan 25 #Python
You might like
yii实现图片上传及缩略图生成的方法
2014/12/04 PHP
mac下多个php版本快速切换的方法
2016/10/09 PHP
日期函数扩展类Ver0.1.1
2006/09/07 Javascript
理解JavaScript变量作用域更轻松
2009/10/25 Javascript
Javascript解决常见浏览器兼容问题的12种方法
2010/01/04 Javascript
JQuery之拖拽插件实现代码
2011/04/14 Javascript
JQuery+Ajax无刷新分页的实例代码
2014/02/08 Javascript
jquery日历控件实现方法分享
2014/03/07 Javascript
jquery获得同源iframe内body下标签的值的方法
2014/09/25 Javascript
js实现tab切换效果实例
2015/09/16 Javascript
JavaScript通过代码调用Flash显示的方法
2016/02/02 Javascript
Bootstrap被封装的弹层
2016/07/20 Javascript
微信小程序 小程序制作及动画(animation样式)详解
2017/01/06 Javascript
js 实现省市区三级联动菜单效果
2017/02/20 Javascript
Extjs 中的 Treepanel 实现菜单级联选中效果及实例代码
2017/08/22 Javascript
解决vue-router中的query动态传参问题
2018/03/20 Javascript
vue监听对象及对象属性问题
2018/08/20 Javascript
vue+elementUI动态生成面包屑导航教程
2019/11/04 Javascript
Vue 数组和对象更新,但是页面没有刷新的解决方式
2019/11/09 Javascript
Websocket 向指定用户发消息的方法
2020/01/09 Javascript
[03:37]2014DOTA2国际邀请赛 主赛事第一日胜者组TOPPLAY
2014/07/19 DOTA
python中 ? : 三元表达式的使用介绍
2013/10/09 Python
python实现数独算法实例
2015/06/09 Python
使用Python的toolz库开始函数式编程的方法
2018/11/15 Python
Python设计模式之观察者模式原理与用法详解
2019/01/16 Python
python中bs4.BeautifulSoup的基本用法
2019/07/27 Python
python安装后的目录在哪里
2020/06/21 Python
pandas.DataFrame.drop_duplicates 用法介绍
2020/07/06 Python
百思买加拿大:Best Buy Canada
2018/03/20 全球购物
趣天网日本站:Qoo10 JP
2019/09/18 全球购物
Timberland法国官网:购买靴子、鞋子、衣服、夹克和配饰
2019/11/30 全球购物
可以在一个PHP文件里面include另外一个PHP文件两次吗
2015/05/22 面试题
一套带答案的C++笔试题
2014/01/10 面试题
读书伴我成长演讲稿
2014/05/07 职场文书
英文诗歌翻译方法(赏析)
2019/08/16 职场文书
深入理解mysql事务隔离级别和存储引擎
2022/04/12 MySQL