编程 Python

python实现XML解析的方法解析

Posted in Python onNovember 16, 2019

这篇文章主要介绍了python实现XML解析的方法解析,文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下

三种方法：一是xml.dom.*模块，它是W3C DOM API的实现，若需要处理DOM API则该模块很适合；二是xml.sax.*模块，它是SAX API的实现，这个模块牺牲了便捷性来换取速度和内存占用，SAX是一个基于事件的API，这就意味着它可以“在空中”处理庞大数量的的文档，不用完全加载进内存；三是xml.etree.ElementTree模块（简称 ET），它提供了轻量级的Python式的API，相对于DOM来说ET 快了很多，而且有很多令人愉悦的API可以使用，相对于SAX来说ET的ET.iterparse也提供了 “在空中” 的处理方式，没有必要加载整个文档到内存，ET的性能的平均值和SAX差不多，但是API的效率更高一点而且使用起来很方便。

1、DOM(Document Object Model)

一个 DOM 的解析器在解析一个 XML 文档时，一次性读取整个文档，把文档中所有元素保存在内存中的一个树结构里，之后你可以利用DOM 提供的不同的函数来读取或修改文档的内容和结构，也可以把修改过的内容写入xml文件。

python中用xml.dom.minidom来解析xml文件。

本文使用的示例文件movie.xml内容如下

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
  <type>War, Thriller</type>
  <format>DVD</format>
  <year>2003</year>
  <rating>PG</rating>
  <stars>10</stars>
  <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
  <type>Anime, Science Fiction</type>
  <format>DVD</format>
  <year>1989</year>
  <rating>R</rating>
  <stars>8</stars>
  <description>A schientific fiction</description>
</movie>
  <movie title="Trigun">
  <type>Anime, Action</type>
  <format>DVD</format>
  <episodes>4</episodes>
  <rating>PG</rating>
  <stars>10</stars>
  <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
  <type>Comedy</type>
  <format>VHS</format>
  <rating>PG</rating>
  <stars>2</stars>
  <description>Viewable boredom</description>
</movie>
</collection>

python实现如下

# !/usr/bin/python
# -*- coding: UTF-8 -*-

from xml.dom.minidom import parse
import xml.dom.minidom

# 使用minidom解析器打开 XML 文档
DOMTree = xml.dom.minidom.parse("movie.xml")
#得到元素对象
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
  print("Root element : %s" % collection.getAttribute("shelf"))
  #获取标签名
  #print(collection.nodeName)

# 在集合中获取所有电影
movies = collection.getElementsByTagName("movie")

# 打印每部电影的详细信息
for movie in movies:
  print("*****Movie*****")
  if movie.hasAttribute("title"):
    print("Title: %s" % movie.getAttribute("title"))

  type = movie.getElementsByTagName('type')[0]
  print("Type: %s" % type.childNodes[0].data)
  format = movie.getElementsByTagName('format')[0]
  print("Format: %s" % format.childNodes[0].data)
  year=movie.getElementsByTagName("year")
  if len(year)>0:
    print("Year: %s" % year[0].firstChild.data)
    #父节点 parentNode
    #print(year[0].parentNode.nodeName)
  rating = movie.getElementsByTagName('rating')[0]
  print("Rating: %s" % rating.childNodes[0].data)
  description = movie.getElementsByTagName('description')[0]
  # 显示标签对之间的数据
  print("Description: %s" % description.childNodes[0].data)
  #print("Description: %s" % description.firstChild.data)

执行结果：

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

2、ElementTree(元素树)

ElementTree就像一个轻量级的DOM，具有方便友好的API。代码可用性好，速度快，消耗内存少。

在python中，解析xml文件时，会选用ElementTree或者cElementTree，那么两者有什么不同呢？

1、cElementTree速度上要比ElementTree快，比较cElementTree是用c语音写的；

2、debug调试的时候，cElementTree是看不到解析的字段内容的，所以不适合用于调试的情况，而ElementTree可以看到解析的内容，方便调试时取值

3、在用到iter，迭代取某个标签时，cElementTree不能用，因为它没有这个函数，而ElementTree有这个函数；当然可能还有其他函数的差异

所有平时，我们一般这么用，比较速度快吗。调试的时候使用ElementTree。遇到某些特别的函数，只能选择拥有这个函数的使用

try:

import xml.etree.cElementTree as ET

except:

import xml.etree.ElementTree as ET

从Python3.3开始ElementTree模块会自动寻找可用的C库来加快速度

import xml.etree.ElementTree as ET
import sys
import os.path

def traverseXml(element):
  #print (len(element))
  if len(element) > 0:
    for child in element:
      print("********Movie********")
      print("Title：",child.get("title"))
      for childchild in child:
        print(childchild.tag,"：",childchild.text)
      #traverseXml(child)
  #else:
  #  print (element.tag, "----", element.attrib)

def readXml(xmlFile):
  try:
    tree = ET.parse(xmlFile)
    #print("tree type:", type(tree))
    # 获得根节点
    root = tree.getroot()
  except Exception as e: # 捕获除与程序退出sys.exit()相关之外的所有异常
    print("parse ***.xml fail!")
    sys.exit()
  #print("root type:", type(root))
  #root.attrib访问root属性,root.tag标签
  #print(root.tag, "：", root.attrib)
  return root

if __name__ == "__main__":
  xmlFilePath = os.path.abspath("movie.xml")
  root=readXml(xmlFilePath)

  # # 使用下标访问
  # print(root[0][0].text)
  # print(root[1][2].text)
  #根据标签名查找root下的所有标签
  # movies=root.findall("movie")
  #遍历子标签
  # print(len(movies))
  # for movie in movies:
  #   type=movie.find("type")
  #   print(type.text)

  # 遍历xml文件
  traverseXml(root)

3、SAX (simple API for XML )

Python 标准库包含 SAX 解析器，SAX 用事件驱动模型，通过在解析XML的过程中触发一个个的事件并调用用户定义的回调函数来处理XML文件。

SAX是一种基于事件驱动的 API。

利用SAX解析XML文档牵涉到两个部分: 解析器和事件处理器。

解析器负责读取XML文档，并向事件处理器发送事件，如元素开始跟元素结束事件。

而事件处理器则负责对事件作出响应，对传递的XML数据进行处理。

1、对大型文件进行处理；
2、只需要文件的部分内容，或者只需从文件中得到特定信息。
3、想建立自己的对象模型的时候。

在python中使用sax方式处理xml要先引入xml.sax中的parse函数，还有xml.sax.handler中的ContentHandler。

ContentHandler类方法介绍

characters(content)方法

调用时机：

从行开始，遇到标签之前，存在字符，content 的值为这些字符串。

从一个标签，遇到下一个标签之前，存在字符，content 的值为这些字符串。

从一个标签，遇到行结束符之前，存在字符，content 的值为这些字符串。

标签可以是开始标签，也可以是结束标签。

startDocument() 方法

文档启动的时候调用。

endDocument() 方法

解析器到达文档结尾时调用。

startElement(name, attrs)方法

遇到XML开始标签时调用，name是标签的名字，attrs是标签的属性值字典。

endElement(name) 方法

遇到XML结束标签时调用。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import xml.sax

class MovieHandler(xml.sax.ContentHandler):
  def __init__(self):
    self.CurrentData = ""
    self.type = ""
    self.format = ""
    self.year = ""
    self.rating = ""
    self.stars = ""
    self.description = ""

  # 元素开始事件处理
  def startElement(self, tag, attributes):
    self.CurrentData = tag
    if tag == "movie":
      print("*****Movie*****")
      title = attributes["title"]
      print("Title:", title)

  # 元素结束事件处理
  def endElement(self, tag):
    if self.CurrentData == "type":
      print("Type:", self.type)
    elif self.CurrentData == "format":
      print("Format:", self.format)
    elif self.CurrentData == "year":
      print("Year:", self.year)
    elif self.CurrentData == "rating":
      print("Rating:", self.rating)
    elif self.CurrentData == "stars":
      print("Stars:", self.stars)
    elif self.CurrentData == "description":
      print("Description:", self.description)
    self.CurrentData = ""

  # 内容事件处理
  def characters(self, content):
    if self.CurrentData == "type":
      self.type = content
    elif self.CurrentData == "format":
      self.format = content
    elif self.CurrentData == "year":
      self.year = content
    elif self.CurrentData == "rating":
      self.rating = content
    elif self.CurrentData == "stars":
      self.stars = content
    elif self.CurrentData == "description":
      self.description = content

if (__name__ == "__main__"):
  # 创建一个 XMLReader
  parser = xml.sax.make_parser()
  # turn off namepsaces
  parser.setFeature(xml.sax.handler.feature_namespaces, 0)

  # 重写 ContextHandler
  Handler = MovieHandler()
  parser.setContentHandler(Handler)

  parser.parse("movie.xml")

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python实现XML解析的方法解析

- Author -

子不语332

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python实现简单socket程序在两台电脑之间传输消息的方法

Mar 13 Python

python3.4用函数操作mysql5.7数据库

Jun 23 Python

Python决策树分类算法学习

Dec 22 Python

Python基于列表list实现的CRUD操作功能示例

Jan 05 Python

python无限生成不重复(字母,数字,字符)组合的方法

Dec 04 Python

python文本数据处理学习笔记详解

Jun 17 Python

Python 可变类型和不可变类型及引用过程解析

Sep 27 Python

详解python 破解网站反爬虫的两种简单方法

Feb 09 Python

Matplotlib中%matplotlib inline如何使用

Jul 28 Python

python opencv图像处理(素描、怀旧、光照、流年、滤镜原理及实现)

Dec 10 Python

使用Python实现音频双通道分离

Dec 25 Python

使用Djongo模块在Django中使用MongoDB数据库

Jun 20 Python

Python实现自定义读写分离代码实例

Nov 16 #Python

Python大数据之网络爬虫的post请求、get请求区别实例分析

Nov 16 #Python

基于python实现雪花算法过程详解

Nov 16 #Python

Python大数据之使用lxml库解析html网页文件示例

Nov 16 #Python

Python大数据之从网页上爬取数据的方法详解

Nov 16 #Python

简单了解Pandas缺失值处理方法

Nov 16 #Python

python selenium 执行完毕关闭chromedriver进程示例

Nov 15 #Python