python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
python3访问sina首页中文的处理方法
Feb 24 Python
举例讲解Python面相对象编程中对象的属性与类的方法
Jan 19 Python
深入理解Python 关于supper 的 用法和原理
Feb 28 Python
Python自动化运维之Ansible定义主机与组规则操作详解
Jun 13 Python
Python实现12306火车票抢票系统
Jul 04 Python
Python使用Pandas对csv文件进行数据处理的方法
Aug 01 Python
详解Matplotlib绘图之属性设置
Aug 23 Python
浅谈python中统计计数的几种方法和Counter详解
Nov 07 Python
Python: tkinter窗口屏幕居中,设置窗口最大,最小尺寸实例
Mar 04 Python
pyinstaller打包单文件时--uac-admin选项不起作用怎么办
Apr 15 Python
Python定义函数实现累计求和操作
May 03 Python
tensorflow模型的save与restore,及checkpoint中读取变量方式
May 26 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
PHP多线程之内部多线程实例分析
2015/03/09 PHP
PHP使用array_merge重新排列数组下标的方法
2015/07/22 PHP
PHP实现QQ登录实例代码
2016/01/14 PHP
php实现图片上传、剪切功能
2016/05/07 PHP
jquery 清空file域示例(兼容个浏览器)
2013/10/11 Javascript
Javascript 构造函数详解
2014/10/22 Javascript
js使用递归解析xml
2014/12/12 Javascript
使用jquery菜单插件HoverTree仿京东无限级菜单
2014/12/18 Javascript
浅谈javascript的Array.prototype.slice.call
2015/08/31 Javascript
JavaScript如何调试有哪些建议和技巧附五款有用的调试工具
2015/10/28 Javascript
谈谈我对JavaScript原型和闭包系列理解(随手笔记9)
2015/12/24 Javascript
值得分享和收藏的Bootstrap学习教程
2016/05/12 Javascript
在javascript中,null>=0 为真,null==0却为假,null的值详解
2017/02/22 Javascript
JavaScript实现分页效果
2017/03/28 Javascript
利用vscode编写vue的简单配置详解
2017/06/17 Javascript
vue中使用heatmapjs的示例代码(结合百度地图)
2018/09/05 Javascript
JavaScript中七种流行的开源机器学习框架
2018/10/11 Javascript
基于NodeJS开发钉钉回调接口实现AES-CBC加解密
2020/08/20 NodeJs
vue任意关系组件通信与跨组件监听状态vue-communication
2020/10/18 Javascript
python用字典统计单词或汉字词个数示例
2014/04/22 Python
Python中属性和描述符的正确使用
2016/08/23 Python
python运行其他程序的实现方法
2017/07/14 Python
Python插件virtualenv搭建虚拟环境
2017/11/20 Python
Python 中的range(),以及列表切片方法
2018/07/02 Python
简述索引存取方法的作用和建立索引的原则
2013/03/26 面试题
本科生学习总结的自我评价
2013/10/02 职场文书
成人大专自我鉴定范文
2013/10/19 职场文书
考试违纪检讨书
2014/02/02 职场文书
幼儿园父亲节活动方案
2014/03/11 职场文书
卖房授权委托书样本
2014/10/05 职场文书
本科毕业论文导师评语
2014/12/31 职场文书
2016年第29个世界无烟日宣传活动总结
2016/04/06 职场文书
导游词之新疆尼雅遗址
2019/10/16 职场文书
排查并解决Oracle sysaux表空间异常增长
2022/04/20 Oracle
python热力图实现的完整实例
2022/06/25 Python
MySQL自定义函数及触发器
2022/08/05 MySQL