编程 Python

在Python中使用CasperJS获取JS渲染生成的HTML内容的教程

Posted in Python onApril 09, 2015

文章摘要：其实这里casperjs与python没有直接关系,主要依赖casperjs调用phantomjs webkit获取html文件内容。长期以来，爬虫抓取客户端javascript渲染生成的html页面都极为困难, Java里面有 HtmlUnit, 而Python里，我们可以使用独立的跨平台的CasperJS。

创建site.js(接口文件，输入:url，输出:html file)

//USAGE: E:\toolkit\n1k0-casperjs-e3a77d0\bin>python casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile='temp.html' 
     
    var fs = require('fs'); 
    var casper = require('casper').create({ 
     pageSettings: { 
     loadImages: false,     
     loadPlugins: false,    
     userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36 LBBROWSER' 
    }, 
    logLevel: "debug",//日志等级 
    verbose: true  // 记录日志到控制台 
     }); 
    var url = casper.cli.raw.get('url'); 
    var outputfile = casper.cli.raw.get('outputfile'); 
    //请求页面 
    casper.start(url, function () { 
    fs.write(outputfile, this.getHTML(), 'w'); 
    }); 
     
    casper.run();

python 代码, checkout_proxy.py

import json 
    import sys 
    #import requests 
    #import requests.utils, pickle 
    from bs4 import BeautifulSoup 
    import os.path,os 
    import threading 
    #from multiprocessing import Process, Manager 
    from datetime import datetime 
    import traceback 
    import logging 
    import re,random 
    import subprocess 
    import shutil 
    import platform 
      
     
     
     
    output_file = os.path.join(os.path.dirname(os.path.realpath(__file__)),'proxy.txt') 
    global_log = 'http_proxy' + datetime.now().strftime('%Y-%m-%d') + '.log' 
    if not os.path.exists(os.path.join(os.path.dirname(os.path.realpath(__file__)),'logs')): 
      os.mkdir(os.path.join(os.path.dirname(os.path.realpath(__file__)),'logs')) 
    global_log = os.path.join(os.path.dirname(os.path.realpath(__file__)),'logs',global_log) 
     
    logging.basicConfig(level=logging.DEBUG,format='[%(asctime)s] [%(levelname)s] [%(module)s] [%(funcName)s] [%(lineno)d] %(message)s',filename=global_log,filemode='a') 
    log = logging.getLogger(__name__)  
    #manager = Manager() 
    #PROXY_LIST = manager.list() 
    mutex = threading.Lock() 
    PROXY_LIST = [] 
     
     
    def isWindows(): 
      if "Windows" in str(platform.uname()): 
      return True 
      else: 
      return False 
     
     
    def getTagsByAttrs(tagName,pageContent,attrName,attrRegValue): 
      soup = BeautifulSoup(pageContent)                                                 
      return soup.find_all(tagName, { attrName : re.compile(attrRegValue) }) 
     
     
    def getTagsByAttrsExt(tagName,filename,attrName,attrRegValue): 
      if os.path.isfile(filename): 
      f = open(filename,'r')    
      soup = BeautifulSoup(f) 
      f.close() 
      return soup.find_all(tagName, { attrName : re.compile(attrRegValue) }) 
      else: 
      return None 
     
     
    class Site1Thread(threading.Thread): 
      def __init__(self,outputFilePath): 
        threading.Thread.__init__(self) 
      self.outputFilePath = outputFilePath 
      self.fileName = str(random.randint(100,1000)) + ".html" 
      self.setName('Site1Thread') 
      
      def run(self): 
      site1_file = os.path.join(os.path.dirname(os.path.realpath(__file__)),'site.js') 
      site2_file = os.path.join(self.outputFilePath,'site.js') 
      if not os.path.isfile(site2_file) and os.path.isfile(site1_file): 
        shutil.copy(site1_file,site2_file) 
      #proc = subprocess.Popen(["bash","-c", "cd %s && ./casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile=%s" % (self.outputFilePath,self.fileName) ],stdout=subprocess.PIPE) 
      if isWindows(): 
        proc = subprocess.Popen(["cmd","/c", "%s/casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile=%s" % (self.outputFilePath,self.fileName) ],stdout=subprocess.PIPE) 
      else: 
        proc = subprocess.Popen(["bash","-c", "cd %s && ./casperjs site.js --url=http://spys.ru/free-proxy-list/IE/ --outputfile=%s" % (self.outputFilePath,self.fileName) ],stdout=subprocess.PIPE) 
      out=proc.communicate()[0] 
      htmlFileName = '' 
      #因为输出路径在windows不确定，所以这里加了所有可能的路径判断 
      if os.path.isfile(self.fileName): 
        htmlFileName = self.fileName 
      elif os.path.isfile(os.path.join(self.outputFilePath,self.fileName)): 
        htmlFileName = os.path.join(self.outputFilePath,self.fileName) 
      elif os.path.isfile(os.path.join(os.path.dirname(os.path.realpath(__file__)),self.fileName)): 
        htmlFileName = os.path.join(os.path.dirname(os.path.realpath(__file__)),self.fileName)  
      if (not os.path.isfile(htmlFileName)): 
        print 'Failed to get html content from http://spys.ru/free-proxy-list/IE/' 
        print out 
        sys.exit(3)  
      mutex.acquire() 
      PROXYList= getTagsByAttrsExt('font',htmlFileName,'class','spy14$') 
      for proxy in PROXYList: 
        tdContent = proxy.renderContents() 
        lineElems = re.split('[<>]',tdContent) 
        if re.compile(r'\d+').search(lineElems[-1]) and re.compile('(\d+\.\d+\.\d+)').search(lineElems[0]): 
        print lineElems[0],lineElems[-1] 
        PROXY_LIST.append("%s:%s" % (lineElems[0],lineElems[-1])) 
      mutex.release() 
      try: 
        if os.path.isfile(htmlFileName): 
        os.remove(htmlFileName) 
      except: 
        pass 
     
    if __name__ == '__main__': 
      try: 
      if(len(sys.argv)) < 2: 
        print "Usage:%s [casperjs path]" % (sys.argv[0]) 
        sys.exit(1)  
      if not os.path.exists(sys.argv[1]): 
        print "casperjs path: %s does not exist!" % (sys.argv[1]) 
        sys.exit(2)  
      if os.path.isfile(output_file): 
        f = open(output_file) 
        lines = f.readlines() 
        f.close 
        for line in lines: 
        PROXY_LIST.append(line.strip()) 
      thread1 = Site1Thread(sys.argv[1]) 
      thread1.start() 
      thread1.join() 
       
      f = open(output_file,'w') 
      for proxy in set(PROXY_LIST): 
        f.write(proxy+"\n") 
      f.close() 
      print "Done!" 
      except SystemExit: 
      pass 
      except: 
        errMsg = traceback.format_exc() 
        print errMsg 
        log.error(errMsg)

在Python中使用CasperJS获取JS渲染生成的HTML内容的教程

- Author -

Ihavegotyou

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

全面解读Python Web开发框架Django

Jun 30 Python

Python的Flask框架中@app.route的用法教程

Mar 31 Python

在Python的列表中利用remove()方法删除元素的教程

May 21 Python

python list排序的两种方法及实例讲解

Mar 20 Python

Python Unittest根据不同测试环境跳过用例的方法

Dec 16 Python

python Tkinter的图片刷新实例

Jun 14 Python

opencv python如何实现图像二值化

Feb 03 Python

pandas中的数据去重处理的实现方法

Feb 10 Python

Python各种扩展名区别点整理

Feb 27 Python

keras分类之二分类实例(Cat and dog)

Jul 09 Python

Python3爬虫中Ajax的用法

Jul 10 Python

Python中的变量与常量

Nov 11 Python

举例讲解Python程序与系统shell交互的方式

Apr 09 #Python

使用Python中的cookielib模拟登录网站

Apr 09 #Python

列举Python中吸引人的一些特性

Apr 09 #Python

Python的Bottle框架的一些使用技巧介绍

Apr 08 #Python

在Python的框架中为MySQL实现restful接口的教程

Apr 08 #Python

简单介绍Python的轻便web框架Bottle

Apr 08 #Python

常见的在Python中实现单例模式的三种方法

Apr 08 #Python

You might like

PHP中feof()函数实例测试

2014/08/23 PHP

php实现的支持断点续传的文件下载类

2014/09/23 PHP

THINKPHP项目开发中的日志记录实例分析

2014/12/01 PHP

php集成动态口令认证

2016/07/21 PHP

laravel框架模型中非静态方法也能静态调用的原理分析

2019/11/23 PHP

jQuery 下拉列表二级联动插件分享

2012/03/29 Javascript

javascript实现youku的视频代码自适应宽度

2015/05/25 Javascript

Javascript实现获取及设置光标位置的方法

2015/07/21 Javascript

js实现图片无缝滚动

2015/12/23 Javascript

预防网页挂马的方法总结

2016/11/03 Javascript

Ionic+AngularJS实现登录和注册带验证功能

2017/02/09 Javascript

JS实现的tab切换选项卡效果示例

2017/02/28 Javascript

微信小程序实现简单input正则表达式验证功能示例

2017/11/30 Javascript

用 js 写一个 js 解释器过程详解

2019/08/02 Javascript

使用axios发送post请求,将JSON数据改为form类型的示例

2019/10/31 Javascript

通过GASP让vue实现动态效果实例代码详解

2019/11/24 Javascript

Vue路由切换页面不更新问题解决方案

2020/07/10 Javascript

JavaScript通如何过RGraph实现动态仪表盘

2020/10/15 Javascript

python3图片转换二进制存入mysql

2013/12/06 Python

Python代码太长换行的实现

2019/07/05 Python

python打印异常信息的两种实现方式

2019/12/24 Python

Python开发之基于模板匹配的信用卡数字识别功能

2020/01/13 Python

django 解决model中类写不到数据库中,数据库无此字段的问题

2020/05/20 Python

Python的logging模块基本用法

2020/12/24 Python

香蕉共和国Banana Republic官网：美国GAP旗下偏贵族风格服饰品牌

2016/11/21 全球购物

英国最大的在线时尚眼镜店：Eyewearbrands

2019/03/12 全球购物

优秀中专生推荐信

2013/11/17 职场文书

《有趣的发现》教学反思

2014/04/15 职场文书

乡镇精神文明建设汇报材料

2014/08/15 职场文书

党员群众路线自我剖析材料

2014/10/06 职场文书

班主任培训研修日志

2015/11/13 职场文书

2016会计专业自荐信范文

2016/01/28 职场文书

2016年教师党员创先争优承诺书

2016/03/24 职场文书

SpringBoot SpringEL表达式的使用

2021/07/25 Java/Android

JavaScript函数柯里化

2021/11/07 Javascript

PC版《死亡搁浅导剪版》现已发售展开全新的探险

2022/04/03 其他游戏