python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
python复制文件的方法实例详解
May 22 Python
在Django的视图(View)外使用Session的方法
Jul 23 Python
Python爬取三国演义的实现方法
Sep 12 Python
Python实现求两个csv文件交集的方法
Sep 06 Python
Python操作Oracle数据库的简单方法和封装类实例
May 07 Python
python 实现登录网页的操作方法
May 11 Python
Python生态圈图像格式转换问题(推荐)
Dec 02 Python
python range实例用法分享
Feb 06 Python
简单介绍一下pyinstaller打包以及安全性的实现
Jun 02 Python
Python创建文件夹与文件的快捷方法
Dec 08 Python
python中str内置函数用法总结
Dec 27 Python
利于python脚本编写可视化nmap和masscan的方法
Dec 29 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
全国FM电台频率大全 - 16 河南省
2020/03/11 无线电
浅谈php错误提示及查错方法
2015/07/14 PHP
thinkphp5 migrate数据库迁移工具
2018/02/20 PHP
php微信公众号开发之翻页查询
2018/10/20 PHP
Thinkphp5.0 框架使用模型Model添加、更新、删除数据操作详解
2019/10/11 PHP
JavaScript类和继承 prototype属性
2010/09/03 Javascript
jquery.messager.js插件导致页面抖动的解决方法
2013/07/14 Javascript
js报$ is not a function 的问题的解决方法
2014/01/20 Javascript
js获取IP地址的方法小结
2014/07/01 Javascript
JavaScript对象属性检查、增加、删除、访问操作实例
2015/07/08 Javascript
JavaScript 对象深入学习总结(经典)
2015/09/29 Javascript
jQuery使用contains过滤器实现精确匹配方法详解
2016/02/25 Javascript
javaScript基础详解
2017/01/19 Javascript
Vue制作Todo List网页
2017/04/26 Javascript
vue快捷键与基础指令详解
2017/06/01 Javascript
ajax +NodeJS 实现图片上传实例
2017/06/06 NodeJs
jfinal与bootstrap的登出实战详解
2017/11/27 Javascript
详解Vue单元测试Karma+Mocha学习笔记
2018/01/31 Javascript
微信小程序radio组件使用详解
2018/01/31 Javascript
vue项目设置scrollTop不起作用(总结)
2018/12/21 Javascript
Vue项目部署的实现(阿里云+Nginx代理+PM2)
2019/03/26 Javascript
vue中注册自定义的全局js方法
2019/11/15 Javascript
jquery实现吸顶导航效果
2020/01/08 jQuery
浅析JavaScript预编译和暗示全局变量
2020/09/03 Javascript
浅谈Python中的作用域规则和闭包
2018/03/20 Python
创建Django项目图文实例详解
2019/06/06 Python
Python函数中的可变长参数详解
2019/09/12 Python
Python collections中的双向队列deque简单介绍详解
2019/11/04 Python
python实现将一维列表转换为多维列表(numpy+reshape)
2019/11/29 Python
巴西独家产品和现场演示购物网站:Shoptime
2019/07/11 全球购物
香港士多网上超级市场:Ztore
2021/01/09 全球购物
超市端午节活动方案
2014/01/23 职场文书
小学教代会开幕词
2016/03/04 职场文书
MySQL索引知识的一些小妙招总结
2021/05/10 MySQL
python for循环赋值问题
2021/06/03 Python
Go本地测试解耦任务拆解及沟通详解Go本地测试的思路沟通的重要性总结
2022/06/21 Golang