使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解


Posted in Python onJanuary 25, 2020

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最基础的内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、子节点

一个Tag可能包含多个字符串或者其他Tag,这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。

1.通过Tag的名字来获得Tag

print(soup.head)
print(soup.title)
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

通过名字的方法只能获得第一个Tag,如果要获得所有的某种Tag可以使用find_all方法

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents属性:将Tag的子节点通过列表的方式返回

head_tag = soup.head
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
<title>The Dormouse's story</title>
title_tag.contents
["The Dormouse's story"]

3.children:通过该属性对子节点进行循环

for child in title_tag.children:
  print(child)
The Dormouse's story

4.descendants: 不论是contents还是children都是返回直接子节点,而descendants对所有tag的子孙节点进行递归循环

for child in head_tag.children:
  print(child)
<title>The Dormouse's story</title>
for child in head_tag.descendants:
  print(child)
<title>The Dormouse's story</title>
The Dormouse's story

5.string 如果tag只有一个NavigableString类型的子节点,那么tag可以使用.string得到该子节点

title_tag.string
"The Dormouse's story"

如果一个tag只有一个子节点,那么使用.string可以获得其唯一子结点的NavigableString.

head_tag.string
"The Dormouse's story"

如果tag有多个子节点,tag无法确定.string对应的是那个子结点的内容,故返回None

print(soup.html.string)
None

6.strings和stripped_strings

如果tag包含多个字符串,可以使用.strings循环获取

for string in soup.strings:
  print(string)
The Dormouse's story


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

.string输出的内容包含了许多空格和空行,使用strpped_strings去除这些空白内容

for string in soup.stripped_strings:
  print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

二、父节点

1.parent:获得某个元素的父节点

title_tag = soup.title
title_tag.parent
<head><title>The Dormouse's story</title></head>

字符串也有父节点

title_tag.string.parent
<title>The Dormouse's story</title>

2.parents:递归的获得所有父辈节点

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)
p
body
html
[document]

三、兄弟结点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())
<html>
 <body>
 <a>
  <b>
  text1
  </b>
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling
<c>text2</c>
sibling_soup.c.previous_sibling
<b>text1</b>

在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n
',\n'
soup.a.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))
' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

四、回退与前进

1.next_element和previous_element

指向下一个或者前一个被解析的对象(字符串或tag),即深度优先遍历的后序节点和前序节点

last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)
;
and they lived at the bottom of a well.
Tillie
last_a_tag.previous_element
' and\n'

2.next_elements和previous_elements

通过.next_elements和previous_elements可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

for element in last_a_tag.next_elements:
  print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

更多关于使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的方法与文章大家可以点击下面的相关文章

Python 相关文章推荐
python实现保存网页到本地示例
Mar 16 Python
python3编写C/S网络程序实例教程
Aug 25 Python
基于Django filter中用contains和icontains的区别(详解)
Dec 12 Python
Python import与from import使用及区别介绍
Sep 06 Python
Python中的元组介绍
Jan 28 Python
Python数据分析模块pandas用法详解
Sep 04 Python
python基于三阶贝塞尔曲线的数据平滑算法
Dec 27 Python
PyTorch 随机数生成占用 CPU 过高的解决方法
Jan 13 Python
Python3 xml.etree.ElementTree支持的XPath语法详解
Mar 06 Python
用Python将库打包发布到pypi
Apr 13 Python
python异步的ASGI与Fast Api实现
Jul 16 Python
Python matplotlib绘制条形统计图 处理多个实验多组观测值
Apr 21 Python
Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
Jan 25 #Python
Python爬虫库BeautifulSoup的介绍与简单使用实例
Jan 25 #Python
使用Python爬虫库requests发送表单数据和JSON数据
Jan 25 #Python
Python爬虫库requests获取响应内容、响应状态码、响应头
Jan 25 #Python
使用Python爬虫库requests发送请求、传递URL参数、定制headers
Jan 25 #Python
flask框架自定义url转换器操作详解
Jan 25 #Python
常用python爬虫库介绍与简要说明
Jan 25 #Python
You might like
鸡肋的PHP单例模式应用详解
2013/06/03 PHP
php获取网页请求状态程序示例
2014/06/17 PHP
yii使用activeFileField控件实现上传文件与图片的方法
2015/12/28 PHP
Symfony2中被遗弃的getRequest()方法分析
2016/03/17 PHP
PHP中常用的数组操作方法笔记整理
2016/05/16 PHP
静态页面下用javascript操作ACCESS数据库(读增改删)的代码
2007/05/14 Javascript
jQuery 注意事项 与原因分析
2009/04/24 Javascript
Javascript实现CheckBox的全选与取消全选的代码
2010/07/20 Javascript
基于jquery的横向滚动条(滑动条)
2011/02/24 Javascript
求数组最大最小值方法适用于任何数组
2013/08/16 Javascript
JavaScript截取指定长度字符串点击可以展开全部代码
2015/12/04 Javascript
从重置input file标签中看jQuery的 .val() 和 .attr(“value”) 区别
2016/06/12 Javascript
js 获取元素所有兄弟节点的实现方法
2016/09/06 Javascript
js date 格式化
2017/02/15 Javascript
Vue2.x中的父组件传递数据至子组件的方法
2017/05/01 Javascript
weui框架实现上传、预览和删除图片功能代码
2017/08/24 Javascript
ES6 javascript的异步操作实例详解
2017/10/30 Javascript
Vue官方推荐AJAX组件axios.js使用方法详解与API
2018/10/09 Javascript
[00:23]魔方之谜解锁款式
2018/12/20 DOTA
[36:41]完美世界DOTA2联赛循环赛FTD vs Magma第一场 10月30日
2020/10/31 DOTA
在MAC上搭建python数据分析开发环境
2016/01/26 Python
详解python发送各类邮件的主要方法
2016/12/22 Python
对python pandas读取剪贴板内容的方法详解
2019/01/24 Python
Python3字符串encode与decode的讲解
2019/04/02 Python
Pytorch在NLP中的简单应用详解
2020/01/08 Python
简单了解如何封装自己的Python包
2020/07/08 Python
Dyson加拿大官方网站:购买戴森吸尘器,风扇,冷热器及配件
2016/10/26 全球购物
英国领先的男装设计师服装独立零售商:Repertoire Fashion
2020/10/19 全球购物
通信生自我鉴定
2014/01/18 职场文书
土建专业大学生自荐信范文
2014/04/09 职场文书
党的群众路线教育实践活动查摆问题自查报告
2014/10/10 职场文书
2014年保管员工作总结
2014/11/18 职场文书
单位收入证明范本
2015/06/18 职场文书
护士自荐信范文(2016推荐篇)
2016/01/28 职场文书
解析MySQL索引的作用
2022/03/03 MySQL
SQL语句多表联合查询的方法示例
2022/04/18 MySQL