使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解


Posted in Python onJanuary 25, 2020

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最基础的内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、子节点

一个Tag可能包含多个字符串或者其他Tag,这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。

1.通过Tag的名字来获得Tag

print(soup.head)
print(soup.title)
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

通过名字的方法只能获得第一个Tag,如果要获得所有的某种Tag可以使用find_all方法

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents属性:将Tag的子节点通过列表的方式返回

head_tag = soup.head
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
<title>The Dormouse's story</title>
title_tag.contents
["The Dormouse's story"]

3.children:通过该属性对子节点进行循环

for child in title_tag.children:
  print(child)
The Dormouse's story

4.descendants: 不论是contents还是children都是返回直接子节点,而descendants对所有tag的子孙节点进行递归循环

for child in head_tag.children:
  print(child)
<title>The Dormouse's story</title>
for child in head_tag.descendants:
  print(child)
<title>The Dormouse's story</title>
The Dormouse's story

5.string 如果tag只有一个NavigableString类型的子节点,那么tag可以使用.string得到该子节点

title_tag.string
"The Dormouse's story"

如果一个tag只有一个子节点,那么使用.string可以获得其唯一子结点的NavigableString.

head_tag.string
"The Dormouse's story"

如果tag有多个子节点,tag无法确定.string对应的是那个子结点的内容,故返回None

print(soup.html.string)
None

6.strings和stripped_strings

如果tag包含多个字符串,可以使用.strings循环获取

for string in soup.strings:
  print(string)
The Dormouse's story


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

.string输出的内容包含了许多空格和空行,使用strpped_strings去除这些空白内容

for string in soup.stripped_strings:
  print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

二、父节点

1.parent:获得某个元素的父节点

title_tag = soup.title
title_tag.parent
<head><title>The Dormouse's story</title></head>

字符串也有父节点

title_tag.string.parent
<title>The Dormouse's story</title>

2.parents:递归的获得所有父辈节点

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)
p
body
html
[document]

三、兄弟结点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())
<html>
 <body>
 <a>
  <b>
  text1
  </b>
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling
<c>text2</c>
sibling_soup.c.previous_sibling
<b>text1</b>

在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n
',\n'
soup.a.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))
' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

四、回退与前进

1.next_element和previous_element

指向下一个或者前一个被解析的对象(字符串或tag),即深度优先遍历的后序节点和前序节点

last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)
;
and they lived at the bottom of a well.
Tillie
last_a_tag.previous_element
' and\n'

2.next_elements和previous_elements

通过.next_elements和previous_elements可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

for element in last_a_tag.next_elements:
  print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

更多关于使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的方法与文章大家可以点击下面的相关文章

Python 相关文章推荐
Cython 三分钟入门教程
Sep 17 Python
python文件的md5加密方法
Apr 06 Python
Python实现简单的多任务mysql转xml的方法
Feb 08 Python
Python实现的概率分布运算操作示例
Aug 14 Python
python3基于OpenCV实现证件照背景替换
Jul 18 Python
python进行文件对比的方法
Dec 24 Python
python paramiko利用sftp上传目录到远程的实例
Jan 03 Python
python 批量解压压缩文件的实例代码
Jun 27 Python
Django Serializer HiddenField隐藏字段实例
Mar 31 Python
Python如何向SQLServer存储二进制图片
Jun 08 Python
详解Python中的GIL(全局解释器锁)详解及解决GIL的几种方案
Jan 29 Python
python反编译教程之2048小游戏实例
Mar 03 Python
Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
Jan 25 #Python
Python爬虫库BeautifulSoup的介绍与简单使用实例
Jan 25 #Python
使用Python爬虫库requests发送表单数据和JSON数据
Jan 25 #Python
Python爬虫库requests获取响应内容、响应状态码、响应头
Jan 25 #Python
使用Python爬虫库requests发送请求、传递URL参数、定制headers
Jan 25 #Python
flask框架自定义url转换器操作详解
Jan 25 #Python
常用python爬虫库介绍与简要说明
Jan 25 #Python
You might like
在windows iis5下安装php4.0+mysql之我见
2006/10/09 PHP
第二章 PHP入门基础之php代码写法
2011/12/30 PHP
php 表单提交大量数据发生丢失的解决方法
2014/03/03 PHP
thinkphp常见路径用法分析
2014/12/02 PHP
Yii2中datetime类的使用
2016/12/17 PHP
jquery绑定原理 简单解析与实现代码分享
2011/09/06 Javascript
JSON.stringify转换JSON时日期时间不准确的解决方法
2014/08/08 Javascript
JavaScript函数节流概念与用法实例详解
2016/06/20 Javascript
详解Angular 自定义结构指令
2017/06/21 Javascript
使用bootstraptable插件实现表格记录的查询、分页、排序操作
2017/08/06 Javascript
Node层模拟实现multipart表单的文件上传示例
2018/01/02 Javascript
layui-table表复选框勾选的所有行数据获取的例子
2019/09/13 Javascript
python抓取网页中的图片示例
2014/02/28 Python
Python库urllib与urllib2主要区别分析
2014/07/13 Python
在Python中实现贪婪排名算法的教程
2015/04/17 Python
Python中的FTP通信模块ftplib的用法整理
2016/07/08 Python
Python中的pygal安装和绘制直方图代码分享
2017/12/08 Python
python求最大值最小值方法总结
2019/06/25 Python
python分割一个文本为多个文本的方法
2019/07/22 Python
django创建超级用户过程解析
2019/09/18 Python
Python any()函数的使用方法
2019/10/28 Python
pycharm 中mark directory as exclude的用法详解
2020/02/14 Python
Python发送邮件实现基础解析
2020/08/14 Python
Python json解析库jsonpath原理及使用示例
2020/11/25 Python
matplotlib 画动态图以及plt.ion()和plt.ioff()的使用详解
2021/01/05 Python
移动端解决悬浮层(悬浮header、footer)会遮挡住内容的3种方法
2015/03/27 HTML / CSS
Quiksilver美国官网:始于1969年的优质冲浪服和滑雪板外套
2020/04/20 全球购物
积极分子思想汇报
2014/01/04 职场文书
领导干部培训感言
2014/01/23 职场文书
槐乡的孩子教学反思
2014/04/27 职场文书
个人授权委托书范文
2014/09/21 职场文书
离婚协议书范本
2015/01/26 职场文书
2015年大学生村官工作总结
2015/04/21 职场文书
质检员工作总结2015
2015/04/25 职场文书
2015年审计人员工作总结
2015/05/26 职场文书
Python 中 Shutil 模块详情
2021/11/11 Python