python实现将html表格转换成CSV文件的方法


Posted in Python onJune 28, 2015

本文实例讲述了python实现将html表格转换成CSV文件的方法。分享给大家供大家参考。具体如下:

使用方法:python html2csv.py *.html
这段代码使用了 HTMLParser 模块

#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
# Hello, this program is written in Python - http://python.org
programname = 'html2csv - version 2002-09-20 - http://sebsauvage.net'
import sys, getopt, os.path, glob, HTMLParser, re
try:  import psyco ; psyco.jit() # If present, use psyco to accelerate the program
except: pass
def usage(progname):
  ''' Display program usage. '''
  progname = os.path.split(progname)[1]
  if os.path.splitext(progname)[1] in ['.py','.pyc']: progname = 'python '+progname
  return '''%s
A coarse HTML tables to CSV (Comma-Separated Values) converter.
Syntax  : %s source.html
Arguments : source.html is the HTML file you want to convert to CSV.
      By default, the file will be converted to csv with the same
      name and the csv extension (source.html -> source.csv)
      You can use * and ?.
Examples  : %s mypage.html
      : %s *.html
This program is public domain.
Author : Sebastien SAUVAGE <sebsauvage at sebsauvage dot net>
     http://sebsauvage.net
''' % (programname, progname, progname, progname)
class html2csv(HTMLParser.HTMLParser):
  ''' A basic parser which converts HTML tables into CSV.
    Feed HTML with feed(). Get CSV with getCSV(). (See example below.)
    All tables in HTML will be converted to CSV (in the order they occur
    in the HTML file).
    You can process very large HTML files by feeding this class with chunks
    of html while getting chunks of CSV by calling getCSV().
    Should handle badly formated html (missing <tr>, </tr>, </td>,
    extraneous </td>, </tr>...).
    This parser uses HTMLParser from the HTMLParser module,
    not HTMLParser from the htmllib module.
    Example: parser = html2csv()
         parser.feed( open('mypage.html','rb').read() )
         open('mytables.csv','w+b').write( parser.getCSV() )
    This class is public domain.
    Author: Sébastien SAUVAGE <sebsauvage at sebsauvage dot net>
        http://sebsauvage.net
    Versions:
      2002-09-19 : - First version
      2002-09-20 : - now uses HTMLParser.HTMLParser instead of htmllib.HTMLParser.
            - now parses command-line.
    To do:
      - handle <PRE> tags
      - convert html entities (&name; and &#ref;) to Ascii.
      '''
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.CSV = ''   # The CSV data
    self.CSVrow = ''  # The current CSV row beeing constructed from HTML
    self.inTD = 0   # Used to track if we are inside or outside a <TD>...</TD> tag.
    self.inTR = 0   # Used to track if we are inside or outside a <TR>...</TR> tag.
    self.re_multiplespaces = re.compile('\s+') # regular expression used to remove spaces in excess
    self.rowCount = 0 # CSV output line counter.
  def handle_starttag(self, tag, attrs):
    if  tag == 'tr': self.start_tr()
    elif tag == 'td': self.start_td()
  def handle_endtag(self, tag):
    if  tag == 'tr': self.end_tr()
    elif tag == 'td': self.end_td()     
  def start_tr(self):
    if self.inTR: self.end_tr() # <TR> implies </TR>
    self.inTR = 1
  def end_tr(self):
    if self.inTD: self.end_td() # </TR> implies </TD>
    self.inTR = 0      
    if len(self.CSVrow) > 0:
      self.CSV += self.CSVrow[:-1]
      self.CSVrow = ''
    self.CSV += '\n'
    self.rowCount += 1
  def start_td(self):
    if not self.inTR: self.start_tr() # <TD> implies <TR>
    self.CSVrow += '"'
    self.inTD = 1
  def end_td(self):
    if self.inTD:
      self.CSVrow += '",' 
      self.inTD = 0
  def handle_data(self, data):
    if self.inTD:
      self.CSVrow += self.re_multiplespaces.sub(' ',data.replace('\t',' ').replace('\n','').replace('\r','').replace('"','""'))
  def getCSV(self,purge=False):
    ''' Get output CSV.
      If purge is true, getCSV() will return all remaining data,
      even if <td> or <tr> are not properly closed.
      (You would typically call getCSV with purge=True when you do not have
      any more HTML to feed and you suspect dirty HTML (unclosed tags). '''
    if purge and self.inTR: self.end_tr() # This will also end_td and append last CSV row to output CSV.
    dataout = self.CSV[:]
    self.CSV = ''
    return dataout
if __name__ == "__main__":
  try: # Put getopt in place for future usage.
    opts, args = getopt.getopt(sys.argv[1:],None)
  except getopt.GetoptError:
    print usage(sys.argv[0]) # print help information and exit:
    sys.exit(2)
  if len(args) == 0:
    print usage(sys.argv[0]) # print help information and exit:
    sys.exit(2)    
  print programname
  html_files = glob.glob(args[0])
  for htmlfilename in html_files:
    outputfilename = os.path.splitext(htmlfilename)[0]+'.csv'
    parser = html2csv()
    print 'Reading %s, writing %s...' % (htmlfilename, outputfilename)
    try:
      htmlfile = open(htmlfilename, 'rb')
      csvfile = open( outputfilename, 'w+b')
      data = htmlfile.read(8192)
      while data:
        parser.feed( data )
        csvfile.write( parser.getCSV() )
        sys.stdout.write('%d CSV rows written.\r' % parser.rowCount)
        data = htmlfile.read(8192)
      csvfile.write( parser.getCSV(True) )
      csvfile.close()
      htmlfile.close()
    except:
      print 'Error converting %s    ' % htmlfilename
      try:  htmlfile.close()
      except: pass
      try:  csvfile.close()
      except: pass
  print 'All done. '

希望本文所述对大家的Python程序设计有所帮助。

Python 相关文章推荐
用Python中的wxPython实现最基本的浏览器功能
Apr 14 Python
使用Python神器对付12306变态验证码
Jan 05 Python
批量获取及验证HTTP代理的Python脚本
Apr 23 Python
PyQt5 对图片进行缩放的实例
Jun 18 Python
pytz格式化北京时间多出6分钟问题的解决方法
Jun 21 Python
python实现大文本文件分割
Jul 22 Python
利用python-docx模块写批量生日邀请函
Aug 26 Python
Python 中 -m 的典型用法、原理解析与发展演变
Nov 11 Python
最小二乘法及其python实现详解
Feb 24 Python
Python中if有多个条件处理方法
Feb 26 Python
python/golang 删除链表中的元素
Sep 14 Python
python 爬虫基本使用——统计杭电oj题目正确率并排序
Oct 26 Python
python实现根据主机名字获得所有ip地址的方法
Jun 28 #Python
python自动zip压缩目录的方法
Jun 28 #Python
python查找指定具有相同内容文件的方法
Jun 28 #Python
python中getaddrinfo()基本用法实例分析
Jun 28 #Python
python实现搜索指定目录下文件及文件内搜索指定关键词的方法
Jun 28 #Python
分析用Python脚本关闭文件操作的机制
Jun 28 #Python
python实现linux下使用xcopy的方法
Jun 28 #Python
You might like
ThinkPHP中Session用法详解
2014/11/29 PHP
JQuery AJAX提交中文乱码的解决方案
2010/07/02 Javascript
JavaScript 判断日期格式是否正确的实现代码
2011/07/04 Javascript
文本框根据输入内容自适应高度的代码
2011/10/24 Javascript
JavaScript简单实现鼠标拖动选择功能
2014/03/06 Javascript
js实现文章文字大小字号功能完整实例
2014/11/01 Javascript
JQuery显示隐藏DIV的方法及代码实例
2015/04/16 Javascript
浅谈jQuery框架Ajax常用选项
2017/07/08 jQuery
简单实现js鼠标跟随效果
2020/08/02 Javascript
vue-router实现组件间的跳转(参数传递)
2017/11/07 Javascript
在vue项目创建的后初始化首次使用stylus安装方法分享
2018/01/25 Javascript
vue webpack打包优化操作技巧
2018/02/22 Javascript
vue2.0 computed 计算list循环后累加值的实例
2018/03/07 Javascript
使用Angular CLI进行单元测试和E2E测试的方法
2018/03/24 Javascript
jquery简易手风琴插件的封装
2020/10/13 jQuery
[47:42]完美世界DOTA2联赛PWL S2 GXR vs Ink 第一场 11.19
2020/11/20 DOTA
[48:48]完美世界DOTA2联赛PWL S3 Magama vs GXR 第一场 12.19
2020/12/24 DOTA
Python中使用asyncio 封装文件读写
2016/09/11 Python
Python开发网站目录扫描器的实现
2019/02/21 Python
python3.6下Numpy库下载与安装图文教程
2019/04/02 Python
python-tornado的接口用swagger进行包装的实例
2019/08/29 Python
提升python处理速度原理及方法实例
2019/12/25 Python
完美解决keras保存好的model不能成功加载问题
2020/06/11 Python
python字典按照value排序方法
2020/12/28 Python
HTML5+Canvas+CSS3实现齐天大圣孙悟空腾云驾雾效果
2016/04/26 HTML / CSS
详解如何解决canvas图片getImageData,toDataURL跨域问题
2018/09/17 HTML / CSS
荷兰优雅女装网上商店:Heine
2016/11/14 全球购物
FitFlop美国官网:英国符合人体工学的鞋类品牌
2018/10/05 全球购物
zooplus德国:便宜地订购动物用品、动物饲料、动物食品
2020/05/06 全球购物
2015年消费者权益日活动总结
2015/02/09 职场文书
预备党员考察表党小组意见
2015/06/01 职场文书
《1942》观后感
2015/06/08 职场文书
青年志愿者活动感想
2015/08/07 职场文书
详解MySQL InnoDB存储引擎的内存管理
2021/04/08 MySQL
python基础入门之字典和集合
2021/06/13 Python
使用python绘制横竖条形图
2022/04/21 Python