编程 Python

详解BeautifulSoup获取特定标签下内容的方法

Posted in Python onDecember 07, 2020

以下是个人在学习beautifulSoup过程中的一些总结，目前我在使用爬虫数据时使用的方法的是：先用find_all()找出需要内容所在的标签，如果所需内容一个find_all()不能满足，那就用两个或者多个。接下来遍历find_all的结果，用get_txt（）、get(‘href')、得到文本或者链接，然后放入各自的列表中。这样做有一个缺点就是txt的数据是一个单独的列表，链接的数据也是一个单独的列表，一方面不能体现这些数据之间的结构性，另一方面当想要获得更多的内容时，就要创建更多的空列表。

遍历所有标签：

soup.find_all('a')

找出所有页面中含有标签a的html语句，结果以列表形式存储。对找到的标签可以进一步处理，如用for对结果遍历，可以对结果进行purify，得到如链接，字符等结果。

# 创建空列表
links=[] 
txts=[]
tags=soup.find_all('a')
for tag in tags:
  links.append(tag.get('href')
  txts.append(tag.txt)         #或者txts.append(tag.get_txt())

得到html的属性名：

atr=[]
tags=soup.find_all('a')
for tag in tags:
  atr.append(tag.p('class')) # 得到a 标签下，子标签p的class名称

find_all()的相关用法实例：

实例来自BeautifulSoup中文文档

1. 字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

2.正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示和标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b

下面代码找出所有名字中包含”t”的标签:

for tag in soup.find_all(re.compile("t")):
  print(tag.name)
# html
# title

3.列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有标签和标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

4.方法（自定义函数，传入find_all）

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4] ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False
下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')```

返回结果中只有

标签没有标签,因为标签还定义了”id”,没有返回和,因为和中没有定义”class”属性.
下面代码找到所有被文字包含的节点内容:

from bs4 import NavigableString
def surrounded_by_strings(tag):
  return (isinstance(tag.next_element, NavigableString)
      and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
  print tag.name
# p
# a
# a
# a
# p

5.按照CSS搜索

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

或者：

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

6.按照text参数查找

通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True . 看例子:

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
  ""Return True if this string is the only child of its parent tag.""
  return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

虽然 text 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 .string 方法与 text 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的标签:

soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>]

7.只查找当前标签的子节点

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

一段简单的文档:

<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
...

是否使用 recursive 参数的搜索结果:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

到此这篇关于详解BeautifulSoup获取特定标签下内容的方法的文章就介绍到这了,更多相关BeautifulSoup获取特定标签内容内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木！

详解BeautifulSoup获取特定标签下内容的方法

- Author -

qianc6350528

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python中datetime常用时间处理方法

Jun 15 Python

Python基于高斯消元法计算线性方程组示例

Jan 17 Python

python中字符串变二维数组的实例讲解

Apr 03 Python

python3模块smtplib实现发送邮件功能

May 22 Python

python爬虫正则表达式之处理换行符

Jun 08 Python

python selenium自动上传有赞单号的操作方法

Jul 05 Python

实例讲解Python脚本成为Windows中运行的exe文件

Jan 24 Python

使用 Python 处理3万多条数据只要几秒钟

Jan 19 Python

浅谈keras2 predict和fit_generator的坑

Jun 17 Python

python让函数不返回结果的方法

Jun 22 Python

解决TensorFlow调用Keras库函数存在的问题

Jul 06 Python

Python如何实现感知器的逻辑电路

Dec 25 Python

使用BeautifulSoup4解析XML的方法小结

Dec 07 #Python

BeautifulSoup获取指定class样式的div的实现

Dec 07 #Python

用Python实现童年贪吃蛇小游戏功能的实例代码

Dec 07 #Python

Selenium+BeautifulSoup+json获取Script标签内的json数据

Dec 07 #Python

Python爬虫实战案例之爬取喜马拉雅音频数据详解

Dec 07 #Python

用python对excel查重

Dec 07 #Python

python3 通过 pybind11 使用Eigen加速代码的步骤详解

Dec 07 #Python

You might like

php下过滤HTML代码的函数

2007/12/10 PHP

php实现批量压缩图片文件大小的脚本

2014/07/04 PHP

用php代码限制国内IP访问我们网站

2015/09/26 PHP

php实现阳历阴历互转的方法

2015/10/28 PHP

php封装的page分页类完整实例

2016/10/18 PHP

JS类定义原型方法的两种实现的区别评论很多

2007/09/12 Javascript

JS下高效拼装字符串的几种方法比较与测试代码

2010/04/15 Javascript

jquery实现奇偶行赋值不同css值

2012/02/17 Javascript

JS打印gridview实现原理及代码

2013/02/05 Javascript

JavaScript语言对Unicode字符集的支持详解

2014/12/30 Javascript

使用Jquery实现每日签到功能

2015/04/03 Javascript

浅谈document.write()输出样式

2015/05/07 Javascript

Javascript编程之继承实例汇总

2015/11/28 Javascript

jQuery使用contains过滤器实现精确匹配方法详解

2016/02/25 Javascript

webpack4+react多页面架构的实现

2018/10/25 Javascript

详解express使用vue-router的history踩坑

2019/06/05 Javascript

深入分析jQuery.one() 函数

2020/06/03 jQuery

Vue如何实现验证码输入交互

2020/12/07 Vue.js

Python pass 语句使用示例

2014/03/11 Python

python基础知识小结之集合

2015/11/25 Python

Python实现利用163邮箱远程关电脑脚本

2018/02/22 Python

对numpy中轴与维度的理解

2018/04/18 Python

tensorflow: 查看 tensor详细数值方法

2018/06/13 Python

在Python中使用gRPC的方法示例

2018/08/08 Python

Django使用paginator插件实现翻页功能的实例

2018/10/24 Python

Python生成器的使用方法和示例代码

2019/03/04 Python

TensorFlow2.1.0最新版本安装详细教程

2020/04/08 Python

SVG实现多彩圆环倒计时效果的示例代码

2017/11/21 HTML / CSS

哥伦比亚最大的网上商店：Linio哥伦比亚

2016/09/25 全球购物

世界上最大的乐谱选择：Sheet Music Plus

2020/01/18 全球购物

《千年梦圆在今朝》教学反思

2014/02/24 职场文书

大学生就业协议书范本（适用于公司企业）

2014/10/07 职场文书

2014年政教处工作总结

2014/12/20 职场文书

千手观音观后感

2015/06/03 职场文书

《牧场之国》教学反思

2016/02/22 职场文书

laravel添加角色和模糊搜索功能的实现代码

2021/06/22 PHP