使用python解析xml成对应的html示例分享


Posted in Python onApril 02, 2014

SAX将dd.xml解析成html。当然啦,如果得到了xml对应的xsl文件可以直接用libxml2将其转换成html。

#!/usr/bin/env python 
# -*- coding: utf-8 -*-
#---------------------------------------
#   程序:XML解析器
#   版本:01.0
#   作者:mupeng
#   日期:2013-12-18
#   语言:Python 2.7
#   功能:将xml解析成对应的html
#   注解:该程序用xml.sax模块的parse函数解析XML,并生成事件
#   继承ContentHandler并重写其事件处理函数
#   Dispatcher主要用于相应标签的起始、结束事件的派发
#---------------------------------------
from xml.sax.handler import ContentHandler
from xml.sax import parse
class Dispatcher:
    def dispatch(self, prefix, name, attrs=None):
        mname = prefix + name.capitalize()
        dname = 'default' + prefix.capitalize()
        method = getattr(self, mname, None)
        if callable(method): args = ()
        else:
            method = getattr(self, dname, None)
            #args = name
        #if prefix == 'start': args += attrs
        if callable(method): method()
    def startElement(self, name, attrs):
        self.dispatch('start', name, attrs)
    def endElement(self, name):
        self.dispatch('end', name)
class Website(Dispatcher, ContentHandler):
    def __init__(self):
        self.fout = open('ddt_SAX.html', 'w')
        self.imagein = False
        self.desflag = False
        self.item = False
        self.title = ''
        self.link = ''
        self.guid = ''
        self.url = ''
        self.pubdate = ''
        self.description = ''
        self.temp = ''
        self.prx = ''
    def startChannel(self):
        self.fout.write('''<html>\n<head>\n<title> RSS-''')
    def endChannel(self):
       self.fout.write('''
                    <tr><td height="20"></td></tr>
                    </table>
                    </center>
                    <script>
    function  GetTimeDiff(str)
    {
     if(str == '')
     {
      return '';
     }
     var pubDate = new Date(str);
     var nowDate = new Date();
     var diffMilSeconds = nowDate.valueOf()-pubDate.valueOf();
     var days = diffMilSeconds/86400000;
     days = parseInt(days);
     diffMilSeconds = diffMilSeconds-(days*86400000);
     var hours = diffMilSeconds/3600000;
     hours = parseInt(hours);
     diffMilSeconds = diffMilSeconds-(hours*3600000);
     var minutes = diffMilSeconds/60000;
     minutes = parseInt(minutes);
     diffMilSeconds = diffMilSeconds-(minutes*60000);
     var seconds = diffMilSeconds/1000;
     seconds = parseInt(seconds);
     var returnStr = "±±¾©·¢²¼Ê±¼ä£º" + pubDate.toLocaleString();
     if(days > 0)
     {
      returnStr = returnStr + " £¨¾àÀëÏÖÔÚ" + days + "Ìì" + hours + "Сʱ" + minutes + "·ÖÖÓ£©";
     }
     else if (hours > 0)
     {
      returnStr = returnStr + " £¨¾àÀëÏÖÔÚ" + hours + "Сʱ" + minutes + "·ÖÖÓ£©";
     }
     else if (minutes > 0)
     {
      returnStr = returnStr + " £¨¾àÀëÏÖÔÚ" + minutes + "·ÖÖÓ£©";
     }
     return returnStr;
    }
    function GetSpanText()
    {
     var pubDate;
     var pubDateArray;
     var spanArray = document.getElementsByTagName("span");
     for(var i = 0; i < spanArray.length; i++)
     {
      pubDate = spanArray[i].innerHTML;
      document.getElementsByTagName("span")[i].innerHTML = GetTimeDiff(pubDate);   
     }
    }
    GetSpanText();
   </script>
                </body>
                </html>
                ''')
       self.fout.close()
    def characters(self, chars):
        if chars.strip():
            #chars = chars.strip()
            self.temp += chars
            #print self.temp
       
    def startTitle(self):
        if self.item:
            self.fout.write('''
                        <tr bgcolor="#eeeeee">\n<td style="padding-top:5px;padding-left:5px;" height="30">\n<B>
                    ''')
    def endTitle(self):
        if not self.imagein and not self.item:
            self.title = self.temp
            self.temp = ''
            self.fout.write(self.title.encode('gb2312'))
            #self.title = self.temp
            self.fout.write('''
                </title>\n</head>\n<body>\n<center>\n
                <script>\n
                        function copyLink()
                        {
                                clipboardData.setData("Text",window.location.href);
                                alert("RSSÁ´½ÓÒѾ­¸´ÖƵ½¼ôÌù°å");
                        }
                        function subscibeLink()
                        {
                                var str = window.location.pathname;
                                while(str.match(/^\//))
                                {
                                        str = str.replace(/^\//,"");
                                }
                                window.open("http://rss.sina.com.cn/my_sina_web_rss_news.html?url=" + str,"_self");
                        }
                        </script>\n
                <table width="750" cellpadding="0" cellspacing="0">\n
                <tr>\n
                <td align="right" style="padding-right:15px;" valign="bottom">\n
            ''')
        if self.item:
            self.title = self.temp
            self.temp = ''
            self.fout.write(self.title.encode('gb2312'))
            self.fout.write('''
                        </B>
                        </td>
                        </tr>
                        <tr bgcolor="#eeeeee">
                        <td style="padding-left:5px;">
                        ''')
    def startImage(self):
        self.imagein = True
    def endImage(self):
        self.imagein = False
    def startLink(self):
        if self.imagein:
            self.fout.write('''<A href=" ''')
            
    def endLink(self):
        self.link = self.temp
        self.temp = ''
        if self.imagein:
            self.fout.write(self.link.encode('gb2312'))
            self.fout.write('''" target="_blank">\n ''')
        elif self.item:
            #self.link = self.temp
            pass
        else:
            self.fout.write(self.link)
            self.fout.write(''' " target="
      _blank
     "> ''')
            self.fout.write(self.title.encode('gb2312'))
            self.fout.write(''' </A></B></td>
                            </tr>
                            <tr><td colspan="2" align="center">
                            ''')
            self.fout.write(self.description.encode('gb2312'))
            self.fout.write('''
                        </td></tr>
                        <tr style="font-size:12px;" bgcolor="#eeeeff"><td colspan="2" style="font-size:14px;padding-top:5px;padding-bottom:5px;"><b><a href="javascript:copyLink();">¸´ÖÆ´ËÒ³Á´½Ó</a>                <a href="javascript:subscibeLink();">ÎÒҪǶÈë¸ÃÐÂÎÅÁÐ±íµ½ÎÒµÄÒ³Ãæ£¨¼òµ¥¡¢¿ìËÙ¡¢ÊµÊ±¡¢Ãâ·Ñ£©</a></b></td></tr>
                        </table>
                        <table width="750" cellpadding="0" cellspacing="0">
                            ''')
    def startUrl(self):
        if self.imagein:
            self.fout.write('''<IMG src=" ''')
    def endUrl(self):
        self.url = self.temp
        self.temp = ''
        if self.imagein:
            self.fout.write(self.url.encode('gb2312'))
            self.fout.write('''" border="0">\n
                            </A>
                            </td>
                            <td align="left" valign="bottom" style="padding-bottom:8px;"><B><A href="
                            ''')
        if self.item:
            #self.url = self.temp
            pass
    def defaultStart(self):
        pass
    def defaultEnd(self):
        self.temp = ''
    def startDescription(self):
        pass
    def endDescription(self):
        self.description = self.temp
        self.temp = ''
        if self.item:
            #self.fout.write('¡¡¡¡')
            self.fout.write(self.description.encode('gb2312'))
    def endGuid(self):
        self.guid = self.temp
    def endPubdate(self):
        if not self.temp.startswith('http'):
         self.pubdate = self.temp
         self.temp = ''
        else:
            self.pubdate = ''
    def startItem(self):
        self.item = True
    def endItem(self):
        self.item = False
        self.fout.write('''
                            </td>
                            </tr>
                            <tr bgcolor="#eeeeee">
                            <td style="padding-top:5px;padding-left:5px;">
                            <A href="''')
        self.fout.write(self.link)
        self.fout.write(''' " target="_blank"> ''')
        self.fout.write(self.guid)
        self.fout.write('''
                        </A>
                        </td>
                        </tr>
                        <tr bgcolor="#eeeeee">
                        <td style="padding-top:5px;padding-left:5px;padding-bottom:5px;"><span>''')
        self.fout.write(self.pubdate)
        self.fout.write('''</span></td>
                        </tr>
                        <tr height="10"><td></td></tr>''')
#程序入口
if __name__ == '__main__':
    parse('ddt.xml', Website())
Python 相关文章推荐
详解Python编程中time模块的使用
Nov 20 Python
实例讲解Python编程中@property装饰器的用法
Jun 20 Python
python2.7的编码问题与解决方法
Oct 04 Python
Python爬虫工程师面试问题总结
Mar 22 Python
基于循环神经网络(RNN)实现影评情感分类
Mar 26 Python
Python基于递归算法实现的汉诺塔与Fibonacci数列示例
Apr 18 Python
基于python3 pyQt5 QtDesignner实现窗口化猜数字游戏功能
Jul 15 Python
python实现BP神经网络回归预测模型
Aug 09 Python
dpn网络的pytorch实现方式
Jan 14 Python
解决pymysql cursor.fetchall() 获取不到数据的问题
May 15 Python
解决python调用自己文件函数/执行函数找不到包问题
Jun 01 Python
Django给表单添加honeypot验证增加安全性
May 06 Python
Python爬虫框架Scrapy安装使用步骤
Apr 01 #Python
使用python绘制人人网好友关系图示例
Apr 01 #Python
python异步任务队列示例
Apr 01 #Python
用Python编程实现语音控制电脑
Apr 01 #Python
35个Python编程小技巧
Apr 01 #Python
ptyhon实现sitemap生成示例
Mar 30 #Python
python实现百度关键词排名查询
Mar 30 #Python
You might like
Oracle Faq(Oracle的版本)
2006/10/09 PHP
使用php4加速网络传输
2006/10/09 PHP
显示youtube视频缩略图和Vimeo视频缩略图代码分享
2014/02/13 PHP
PHP中is_file不能替代file_exists的理由
2014/03/04 PHP
PHP Curl出现403错误的解决办法
2014/05/29 PHP
浅谈php错误提示及查错方法
2015/07/14 PHP
JavaScript中的一些定位属性[图解]
2010/07/14 Javascript
JavaScript 基础篇之运算符、语句(二)
2012/04/07 Javascript
js运动框架_包括图片的淡入淡出效果
2013/05/11 Javascript
JavaScript对象之深度克隆介绍
2014/12/08 Javascript
js闭包实现按秒计数
2015/04/23 Javascript
实例详解jQuery结合GridView控件的使用方法
2016/01/04 Javascript
学习Angularjs分页指令
2016/07/01 Javascript
聊一聊JS中的prototype
2016/09/29 Javascript
javascript简单链式调用案例分析
2017/05/10 Javascript
jQuery选择器之基本过滤选择器用法实例分析
2019/02/19 jQuery
Python的ORM框架SQLAlchemy入门教程
2014/04/28 Python
解决Pycharm中import时无法识别自己写的程序方法
2018/05/18 Python
Django 多语言教程的实现(i18n)
2018/07/07 Python
python random从集合中随机选择元素的方法
2019/01/23 Python
pyqt5利用pyqtDesigner实现登录界面
2019/03/28 Python
用python建立两个Y轴的XY曲线图方法
2019/07/08 Python
详解Python中正则匹配TAB及空格的小技巧
2019/07/26 Python
解决Pymongo insert时会自动添加_id的问题
2020/12/05 Python
金鑫耀Java笔试题
2014/09/06 面试题
市场营销专业个人求职信范文
2013/12/14 职场文书
文明餐桌活动方案
2014/02/11 职场文书
工程售后服务承诺书
2014/05/21 职场文书
办理收楼委托书范本
2014/10/09 职场文书
2015中学教师个人工作总结
2015/07/22 职场文书
大学生受助感言
2015/08/01 职场文书
运动会广播稿50字
2015/08/19 职场文书
用Python实现Newton插值法
2021/04/17 Python
两行代码解决Jupyter Notebook中文不能显示的问题
2021/04/24 Python
纯CSS如何禁止用户复制网页的内容
2021/11/01 HTML / CSS
JS前端可视化canvas动画原理及其推导实现
2022/08/05 Javascript