Python lxml模块的基本使用方法分析


Posted in Python onDecember 21, 2019

本文实例讲述了Python lxml模块的基本使用方法。分享给大家供大家参考,具体如下:

1 lxml的安装

安装方式:pip install lxml

2 lxml的使用

2.1 lxml模块的入门使用

导入lxml 的 etree 库 (导入没有提示不代表不能用)

from lxml import etree

利用etree.HTML,将字符串转化为Element对象,Element对象具有xpath的方法,返回结果的列表,能够接受bytes类型的数据和str类型的数据

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把转化后的element对象转化为字符串,返回bytes类型结果 etree.tostring(element)

假设我们现有如下的html字符换,尝试对他进行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意,此处缺少一个 </li> 闭合标签 
</ul> </div>
from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

输出为

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以发现,lxml确实能够把确实的标签补充完成,但是请注意lxml是人写的,很多时候由于网页不够规范,或者是lxml的bug,即使参考url地址对应的响应去提取数据,任然获取不到,这个时候我们需要使用etree.tostring的方法,观察etree到底把html转化成了什么样子,即根据转化后的html字符串去进行数据的提取。

2.2 lxml的深入练习

接下来我们继续操作,假设每个class为item-1的li标签是1条新闻数据,如何把这条新闻数据组成一个字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")
#组装成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

输出为

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假设在某种情况下,某个新闻的href没有,那么会怎样呢?

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

结果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

数据的对应全部错了,这不是我们想要的,接下来通过2.3小节的学习来解决这个问题

2.3 lxml模块的进阶使用

前面我们取到属性,或者是文本的时候,返回字符串 但是如果我们取到的是一个节点,返回什么呢?

返回的是element对象,可以继续使用xpath方法,对此我们可以在后面的数据提取过程中:先根据某个标签进行分组,分组之后再进行数据的提取

示例如下:

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
print(li_list)

结果为:

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以发现结果是一个element对象,这个对象能够继续使用xpath方法

先根据li标签进行分组,之后再进行数据的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根据li标签进行分组
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
#在每一组中继续进行数据的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

结果是:

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

前面的代码中,进行数据提取需要判断,可能某些一面不存在数据的情况,对应的可以使用三元运算符来解决

Python 相关文章推荐
Python利用pyHook实现监听用户鼠标与键盘事件
Aug 21 Python
收集的几个Python小技巧分享
Nov 22 Python
python3+PyQt5实现柱状图
Apr 24 Python
利用Anaconda简单安装scrapy框架的方法
Jun 13 Python
python opencv实现图像边缘检测
Apr 29 Python
Python 读取用户指令和格式化打印实现解析
Sep 02 Python
python中有关时间日期格式转换问题
Dec 25 Python
TensorFlow查看输入节点和输出节点名称方式
Jan 04 Python
Python实现i人事自动打卡的示例代码
Jan 09 Python
TensorFlow tensor的拼接实例
Jan 19 Python
python正则表达式实例代码
Mar 03 Python
Python中相见恨晚的技巧
Apr 13 Python
python Manager 之dict KeyError问题的解决
Dec 21 #Python
tornado+celery的简单使用详解
Dec 21 #Python
Python selenium的基本使用方法分析
Dec 21 #Python
Flask框架搭建虚拟环境的步骤分析
Dec 21 #Python
Django restframework 框架认证、权限、限流用法示例
Dec 21 #Python
python支持多线程的爬虫实例
Dec 21 #Python
Python 实现try重新执行
Dec 21 #Python
You might like
Yii中CGridView关联表搜索排序方法实例详解
2014/12/03 PHP
php的socket编程详解
2016/11/20 PHP
php生成二维码图片方法汇总
2016/12/17 PHP
PHP三种方式实现链式操作详解
2017/01/21 PHP
thinkphp项目如何自定义微信分享描述内容
2017/02/20 PHP
PHP实现验证码校验功能
2017/11/16 PHP
javascript删除option选项的多种方法总结
2013/11/22 Javascript
ie浏览器使用js导出网页到excel并打印
2014/03/11 Javascript
JavaScript中停止执行setInterval和setTimeout事件的方法
2015/05/14 Javascript
无法获取隐藏元素宽度和高度的解决方案
2017/03/07 Javascript
深入理解React中何时使用箭头函数
2017/08/23 Javascript
bootstrap 通过加减按钮实现输入框组功能
2017/11/15 Javascript
jquery无缝图片轮播组件封装
2020/11/25 jQuery
Node.js动手撸一个静态资源服务器的方法
2019/03/09 Javascript
layui 表格操作列按钮动态显示的实现方法
2019/09/06 Javascript
vue实现select下拉显示隐藏功能
2019/09/30 Javascript
CountUp.js实现数字滚动增值效果
2019/10/17 Javascript
vue实现吸顶、锚点和滚动高亮按钮效果
2019/10/21 Javascript
[01:00:22]DOTA2-DPC中国联赛定级赛 LBZS vs Magma BO3第三场 1月10日
2021/03/11 DOTA
分享Python文本生成二维码实例
2016/01/06 Python
Python3.x爬虫下载网页图片的实例讲解
2018/05/22 Python
python操作mysql代码总结
2018/06/01 Python
python3 中文乱码与默认编码格式设定方法
2018/10/31 Python
Python求一批字符串的最长公共前缀算法示例
2019/03/02 Python
Django时区详解
2019/07/24 Python
python matplotlib库绘制条形图练习题
2019/08/10 Python
python实现超市商品销售管理系统
2019/10/25 Python
Django静态文件加载失败解决方案
2020/08/26 Python
安装不同版本的tensorflow与models方法实现
2021/02/20 Python
花园仓库建筑:Garden Buildings Direct
2018/02/16 全球购物
领先的荷兰线上超市:荷兰之家Holland at Home(支持中文)
2021/01/21 全球购物
解释下面关于J2EE的名词
2013/11/15 面试题
初中英语课后反思
2014/04/25 职场文书
廉洁自律演讲稿
2014/05/22 职场文书
狼牙山五壮士观后感
2015/06/09 职场文书
JavaScript文档对象模型DOM
2021/11/20 Javascript