Python lxml模块的基本使用方法分析


Posted in Python onDecember 21, 2019

本文实例讲述了Python lxml模块的基本使用方法。分享给大家供大家参考,具体如下:

1 lxml的安装

安装方式:pip install lxml

2 lxml的使用

2.1 lxml模块的入门使用

导入lxml 的 etree 库 (导入没有提示不代表不能用)

from lxml import etree

利用etree.HTML,将字符串转化为Element对象,Element对象具有xpath的方法,返回结果的列表,能够接受bytes类型的数据和str类型的数据

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把转化后的element对象转化为字符串,返回bytes类型结果 etree.tostring(element)

假设我们现有如下的html字符换,尝试对他进行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意,此处缺少一个 </li> 闭合标签 
</ul> </div>
from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

输出为

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以发现,lxml确实能够把确实的标签补充完成,但是请注意lxml是人写的,很多时候由于网页不够规范,或者是lxml的bug,即使参考url地址对应的响应去提取数据,任然获取不到,这个时候我们需要使用etree.tostring的方法,观察etree到底把html转化成了什么样子,即根据转化后的html字符串去进行数据的提取。

2.2 lxml的深入练习

接下来我们继续操作,假设每个class为item-1的li标签是1条新闻数据,如何把这条新闻数据组成一个字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")
#组装成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

输出为

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假设在某种情况下,某个新闻的href没有,那么会怎样呢?

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

结果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

数据的对应全部错了,这不是我们想要的,接下来通过2.3小节的学习来解决这个问题

2.3 lxml模块的进阶使用

前面我们取到属性,或者是文本的时候,返回字符串 但是如果我们取到的是一个节点,返回什么呢?

返回的是element对象,可以继续使用xpath方法,对此我们可以在后面的数据提取过程中:先根据某个标签进行分组,分组之后再进行数据的提取

示例如下:

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
print(li_list)

结果为:

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以发现结果是一个element对象,这个对象能够继续使用xpath方法

先根据li标签进行分组,之后再进行数据的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根据li标签进行分组
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
#在每一组中继续进行数据的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

结果是:

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

前面的代码中,进行数据提取需要判断,可能某些一面不存在数据的情况,对应的可以使用三元运算符来解决

Python 相关文章推荐
Python中__call__用法实例
Aug 29 Python
Python的ORM框架中SQLAlchemy库的查询操作的教程
Apr 25 Python
Python爬虫利用cookie实现模拟登陆实例详解
Jan 12 Python
pip安装时ReadTimeoutError的解决方法
Jun 12 Python
python将处理好的图像保存到指定目录下的方法
Jan 10 Python
Python实现简单层次聚类算法以及可视化
Mar 18 Python
75条笑死人的知乎神回复,用60行代码就爬完了
May 06 Python
python字典改变value值方法总结
Jun 21 Python
python实现最大优先队列
Aug 29 Python
如何基于Python批量下载音乐
Nov 11 Python
Anaconda详细安装步骤图文教程
Nov 12 Python
Scrapy实现模拟登录的示例代码
Feb 21 Python
python Manager 之dict KeyError问题的解决
Dec 21 #Python
tornado+celery的简单使用详解
Dec 21 #Python
Python selenium的基本使用方法分析
Dec 21 #Python
Flask框架搭建虚拟环境的步骤分析
Dec 21 #Python
Django restframework 框架认证、权限、限流用法示例
Dec 21 #Python
python支持多线程的爬虫实例
Dec 21 #Python
Python 实现try重新执行
Dec 21 #Python
You might like
php将数据库中所有内容生成静态html文档的代码
2010/04/12 PHP
php fputcsv命令 写csv文件遇到的小问题(多维数组连接符)
2011/05/24 PHP
ThinkPHP标签制作教程
2014/07/10 PHP
php通过strpos查找字符串出现位置的方法
2015/03/17 PHP
PHP判断文件是否被引入的方法get_included_files用法示例
2016/11/29 PHP
AutoSave/自动存储功能实现
2007/03/24 Javascript
比较搞笑的js陷阱题
2010/02/07 Javascript
JQuery里面的几种选择器 查找满足条件的元素$(&quot;#控件ID&quot;)
2011/08/23 Javascript
JQuery中SetTimeOut传参问题探讨
2013/05/10 Javascript
js实现单行文本向上滚动效果实例代码
2013/11/28 Javascript
了不起的node.js读书笔记之例程分析
2014/12/22 Javascript
JavaScript数组前面插入元素的方法
2015/04/06 Javascript
理解Javascript的动态语言特性
2015/06/17 Javascript
jQuery实现的fixedMenu下拉菜单效果代码
2015/08/24 Javascript
JQuery ztree带筛选、异步加载实例讲解
2016/02/25 Javascript
JS实现左右无缝轮播图代码
2016/05/01 Javascript
JS简单判断字符在另一个字符串中出现次数的2种常用方法
2017/04/20 Javascript
Bootstrap table使用方法汇总
2017/11/17 Javascript
详解Nodejs内存治理
2018/05/13 NodeJs
Vue多系统切换实现方案
2018/06/05 Javascript
微信小程序上传多图到服务器并获取返回的路径
2019/05/05 Javascript
node-red File读取好保存实例讲解
2019/09/11 Javascript
六个窍门助你提高Python运行效率
2015/06/09 Python
深入了解Python数据类型之列表
2016/06/24 Python
一道python走迷宫算法题
2018/01/22 Python
使用python中的in ,not in来检查元素是不是在列表中的方法
2018/07/06 Python
Python基础学习之类与实例基本用法与注意事项详解
2019/06/17 Python
Python 如何实现访问者模式
2020/07/28 Python
CSS3实现的闪烁跳跃进度条示例(附源码)
2013/08/19 HTML / CSS
荷兰之家英文站:Holland at Home
2016/10/26 全球购物
ALDO加拿大官网:加拿大女鞋品牌
2018/12/22 全球购物
大学生优秀的自我评价分享
2013/10/22 职场文书
宠物店的创业计划书范文
2014/01/11 职场文书
合伙开公司协议书范本
2014/10/28 职场文书
生死牛玉儒观后感
2015/06/11 职场文书
《惊弓之鸟》教学反思
2016/02/20 职场文书