Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】


Posted in Python onApril 05, 2019

本文实例讲述了Python HTML解析器BeautifulSoup用法。分享给大家供大家参考,具体如下:

BeautifulSoup简介

我们知道,Python拥有出色的内置HTML解析器模块——HTMLParser,然而还有一个功能更为强大的HTML或XML解析工具——BeautifulSoup(美味的汤),它是一个第三方库。简单来说,BeautifulSoup最主要的功能是从网页抓取数据。本文我们来感受一下BeautifulSoup的优雅而强大的功能吧!

BeautifulSoup安装

BeautifulSoup3 目前已经停止开发,推荐在现在的项目中使用BeautifulSoup4,不过它已经被移植到bs4了,也就是说导入时我们需要 import bs4 。可以利用 pip 或者 easy_install 两种方法来安装。下面采用pip安装。

pip install beautifulsoup4
pip install lxml

建议同时安装"lxml"模块,BeautifulSoup支持Python标准库中的HTML解析器(HTMLParser),还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

创建对象

安装后,创建对象:

soup = BeautifulSoup(markup='html文件', 'lxml')

格式化输出:

soup.prettify()

BeautifulSoup四大对象类型

BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

  • Tag(标签)
  • NavigableString(内容)
  • BeautifulSoup(文档)
  • Comment(注释)

1.Tag类型

即HTML的整个标签,如获取<title>标签:

print soup.title
#<title>The Dormouse's story</title>

Tag有两个重要属性:name,attrs。

name

即HTML的标签名称:

print soup.name
#[document]
print soup.head.name
#head

attrs

即HTML的标签属性字典:

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

如果想要单独获取某个属性:

print soup.p['class']
#['title']

2.NavigableString类型

既然我们已经得到了整个标签,那么问题来了,我们要想获取标签内部的文字内容怎么办呢?很简单,用 string 即可:

print soup.p.string
#The Dormouse's story

3.BeautifulSoup类型

BeautifulSoup 对象表示的是一个文档的全部内容.:

print soup.name
# [document]

4.Comment类型

HTML的注释内容,注意的是,不包含注释符号。我们首先判断它的类型,是否为 Comment 类型,然后再进行其他操作,如打印输出:

if type(soup.a.string)==bs4.element.Comment:
  print soup.a.string
#<!-- Elsie -->

遍历文档树

1.子节点

contents

获取所有子节点,返回列表:

print soup.head.contents
#[<title>The Dormouse's story</title>]

children

获取所有子节点,返回列表生成器:

print soup.head.children
#<listiterator object at 0x7f71457f5710>
## 需要遍历
for child in soup.body.children:
  print child
## 结果
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

2.节点内容

string

返回单个文本内容。如果一个标签里面没有标签了,那么 string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 string 也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string 方法应该调用哪个子节点的内容,string 的输出结果是 None。例如:

print soup.head.string
print soup.title.string
#The Dormouse's story
#The Dormouse's story
print soup.html.string
# None

strings

返回多个文本内容,且包含空行和空格。

stripped_strings

返回多个文本内容,且不包含空行和空格:

for string in soup.stripped_strings:
  print(repr(string))
  # u"The Dormouse's story"
  # u"The Dormouse's story"
  # u'Once upon a time there were three little sisters; and their names were'
  # u'Elsie'
  # u','
  # u'Lacie'
  # u'and'
  # u'Tillie'
  # u';\nand they lived at the bottom of a well.'
  # u'...'

get_text()方法

返回当前节点和子节点的文本内容。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
  <p class="title"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister1" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister2" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister3" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
  </p>
  <p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(markup=html_doc,features='lxml')
node_p_text=soup.find('p',class_='story').get_text()
# 注意class_带下划线
print(node_p_text)
# 结果
Once upon a time there were three little sisters; and their names were
    Elsie,
    Lacie and
    Tillie;
    and they lived at the bottom of a well.

3.父节点

parent

返回某节点的直接父节点:

p = soup.p
print p.parent.name
#body

parents

返回某节点的所有父辈及以上辈的节点:

content = soup.head.title.string
for parent in content.parents:
  print parent.name
## 结果
title
head
html
[document]

4.兄弟节点

next_sibling

next_sibling 属性获取该节点的下一个兄弟节点,结果通常是字符串或空白,因为空白或者换行也可以被视作一个节点。

previous_sibling

previous_sibling 属性获取该节点的上一个兄弟节点。

print soup.p.next_sibling
#    实际该处为空白
print soup.p.prev_sibling
#None  没有前一个兄弟节点,返回 None
print soup.p.next_sibling.next_sibling
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>,
#<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>
#下一个节点的下一个兄弟节点是我们可以看到的节点

next_siblingsprevious_siblings

迭代获取全部兄弟节点。

5.前后节点

next_elementprevious_element

不是针对于兄弟节点,而是在于所有节点,不分层次的前一个和后一个节点。

next_elementsprevious_elements

迭代获取所有前和后节点。

搜索文档树

1.find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

find_all()方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件。

参数说明

name参数

name参数很强大,可以传多种方式的参数,查找所有名字为 name 的tag,字符串对象会被自动忽略掉。

(a)传标签名

最简单的过滤器是标签名。在搜索方法中传入一个标签名参数,BeautifulSoup会查找与标签名完整匹配的内容,下面的例子用于查找文档中所有的<a>标签:

print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

返回结果列表中的元素仍然是BeautifulSoup对象。

(b)传正则表达式

如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的 match() 来匹配内容。下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b

(c)传列表

如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有<a>标签和<b>标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

(d)传True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点:

for tag in soup.find_all(True):
  print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a

(e)传函数

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数。如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False:

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]

keyword参数

注意的是,如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,BeautifulSoup会搜索每个tag的”id”属性:

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>]

如果传入 href 参数,Beautiful Soup会搜索每个tag的"href"属性:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性:

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">three</a>]

在这里我们想用 class 过滤,不过 class 是 python 的关键词,这怎么办?加个下划线就可以:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

attrs参数

有些tag属性在搜索不能使用,比如HTML5中的 " data-* " 自定义属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
## 但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

text参数

通过 text 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样,text 参数接受字符串 、正则表达式 、列表、True。

soup.find_all(text="Elsie")
# [u'Elsie']
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(text=re.compile("Dormouse"))# 模糊查找
[u"The Dormouse's story", u"The Dormouse's story"]

limit参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果。

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>]

recursive参数

调用tag的 find_all() 方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]
soup.html.find_all("title", recursive=False)
# []

2.find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果。

3.find_parents() 和 find_parent()

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等。find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容。

4.find_next_siblings() 和 find_next_sibling()  

这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点。

5.find_previous_siblings() 和 find_previous_sibling()

这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点,find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点。

6.find_all_next() 和 find_next()

这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点。

7.find_all_previous() 和 find_previous()

这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代,find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点。

CSS选择器

我们在写 CSS 时,标签名不加任何修饰,类名前加点,id名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list。

通过标签名查找

print soup.select('title')
#[<title>The Dormouse's story</title>]
print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
print soup.select('b')
#[<b>The Dormouse's story</b>]

通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

组合查找

组合查找即和写 class 文件时,标签名与类名、id名进行的组合原理是一样的,例如查找 p 标签中,id 等于 link1的内容,二者需要用空格分开。

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

直接子标签查找:

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

属性查找

查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
print soup.select('a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格:

print soup.select('p a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

以上的 select 方法返回的结果都是列表形式,可以遍历形式输出,然后用 string或get_text() 方法来获取它的内容:

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()
for title in soup.select('title'):
  print title.get_text()

更多关于Python相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
用Python的urllib库提交WEB表单
Feb 24 Python
Python中使用urllib2模块编写爬虫的简单上手示例
Jan 20 Python
django 2.0更新的10条注意事项总结
Jan 05 Python
基于windows下pip安装python模块时报错总结
Jun 12 Python
python中dict字典的查询键值对 遍历 排序 创建 访问 更新 删除基础操作方法
Sep 13 Python
解决Pyinstaller 打包exe文件 取消dos窗口(黑框框)的问题
Jun 21 Python
pytorch:torch.mm()和torch.matmul()的使用
Dec 27 Python
浅谈Python访问MySQL的正确姿势
Jan 07 Python
Pycharm最常用的快捷键及使用技巧
Mar 05 Python
快速解释如何使用pandas的inplace参数的使用
Jul 23 Python
Python超详细分步解析随机漫步
Mar 17 Python
python神经网络 tf.name_scope 和 tf.variable_scope 的区别
May 04 Python
Python HTML解析模块HTMLParser用法分析【爬虫工具】
Apr 05 #Python
Python爬虫实现爬取百度百科词条功能实例
Apr 05 #Python
Python3.5多进程原理与用法实例分析
Apr 05 #Python
Python选择网卡发包及接收数据包
Apr 04 #Python
详解Python的数据库操作(pymysql)
Apr 04 #Python
python dlib人脸识别代码实例
Apr 04 #Python
python图像处理入门(一)
Apr 04 #Python
You might like
ThinkPHP实现动态包含文件的方法
2014/11/29 PHP
PHP使用memcache缓存技术提高响应速度的方法
2014/12/26 PHP
PHP生成腾讯云COS接口需要的请求签名
2018/05/20 PHP
PHP后台备份MySQL数据库的源码实例
2019/03/18 PHP
ExtJs之带图片的下拉列表框插件
2010/03/04 Javascript
分享一道笔试题[有n个直线最多可以把一个平面分成多少个部分]
2012/10/12 Javascript
缓动函数requestAnimationFrame 更好的实现浏览器经动画
2012/12/07 Javascript
Extjs优化(一)删除冗余代码提高运行速度
2013/04/15 Javascript
jquery动态调整div大小使其宽度始终为浏览器宽度
2014/06/06 Javascript
JavaScript中的document.referrer在各种浏览器测试结果
2014/07/18 Javascript
运用jQuery定时器的原理实现banner图片切换
2014/10/22 Javascript
JS打开新窗口防止被浏览器阻止的方法
2015/01/03 Javascript
jQuery实现手机号码输入提示功能实例
2015/04/30 Javascript
JavaScript合并两个数组并去除重复项的方法
2015/06/13 Javascript
javascript实现状态栏中文字动态显示的方法
2015/10/20 Javascript
Js判断H5上下滑动方向及滑动到顶部和底部判断的示例代码
2017/11/15 Javascript
vue+express 构建后台管理系统的示例代码
2018/07/19 Javascript
AngularJS与后端php的数据交互方法
2018/08/13 Javascript
python实现迭代法求方程组的根过程解析
2019/11/25 Javascript
JavaScript实现网页tab栏效果制作
2020/11/20 Javascript
[00:48]完美“圣”典2016风云人物:xiao8宣传片
2016/11/30 DOTA
python网络编程实例简析
2014/09/26 Python
python中关于for循环的碎碎念
2017/06/30 Python
python+opencv轮廓检测代码解析
2018/01/05 Python
浅谈python下含中文字符串正则表达式的编码问题
2018/12/07 Python
python自制包并用pip免提交到pypi仅安装到本机【推荐】
2019/06/03 Python
Python 实现try重新执行
2019/12/21 Python
关于tf.nn.dynamic_rnn返回值详解
2020/01/20 Python
Python函数默认参数常见问题及解决方案
2020/03/26 Python
python ETL工具 pyetl
2020/06/07 Python
私人会所最新创业计划书范文
2014/03/24 职场文书
出纳担保书范文
2014/04/02 职场文书
电子商务专业求职信范文
2015/03/19 职场文书
2015年财务人员工作总结
2015/04/10 职场文书
公司开业的祝贺语大全(60条)
2019/07/05 职场文书
Java SSH 秘钥连接mysql数据库的方法
2021/06/28 Java/Android