Python通过解析网页实现看报程序的方法


Posted in Python onAugust 04, 2014

本文所述实例可以实现基于Python的查看图片报纸《参考消息》并将当天的图片报纸自动下载到本地供查看的功能,具体实现代码如下:

# coding=gbk
import urllib2
import socket
import re
import time
import os

# timeout in seconds
#timeout = 10
#socket.setdefaulttimeout(timeout)
timeout = 10
urllib2.socket.setdefaulttimeout(timeout)

home_url = "http://www.hqck.net"
home_page = ""
try:
  home_page_context = urllib2.urlopen(home_url)
  home_page = home_page_context.read()

  print "Read home page finishd."
  print "-------------------------------------------------"
except urllib2.URLError,e:
  print e.code
  exit()
except:
  print e.code
  exit()

reg_str = r'<a class="item-baozhi" href="/arc/jwbt/ckxx/\d{4}/\d{4}/\w+\.html" rel="external nofollow" ><span class.+>.+</span></a>'

news_url_reg = re.compile(reg_str)

today_cankao_news = news_url_reg.findall(home_page)

if len(today_cankao_news) == 0:
  print "Cannot find today's news!"
  exit()

my_news = today_cankao_news[0]
print "Latest news link = " + my_news
print

url_s = my_news.find("/arc/")
url_e = my_news.find(".html")
url_e = url_e + 5

print "Link index = [" + str(url_s) + "," + str(url_e) + "]"
my_news = my_news[url_s:url_e]
print "part url = " + my_news

full_news_url = home_url + my_news
print "full url = " + full_news_url
print

image_folder = "E:\\new_folder\\"

if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
today_num = time.strftime('%Y-%m-%d',time.localtime(time.time()))
image_folder = image_folder + today_num + "\\"
if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
print "News image folder = " + image_folder
print

context_uri = full_news_url[0:-5]

first_page_url = context_uri + ".html"
try:
  first_page_context = urllib2.urlopen(first_page_url)
  first_page = first_page_context.read()
except urllib2.HTTPError, e:
  print e.code
  exit()

tot_page_index = first_page.find("共")
tot_page_index = tot_page_index

tmp_str = first_page[tot_page_index:tot_page_index+10]
end_s = tmp_str.find("页")

page_num = tmp_str[2:end_s]
print page_num

page_count = int(page_num)
print "Total " + page_num + " pages:"
print

page_index = 1
download_suc = True
while page_index <= page_count:
  page_url = context_uri
  if page_index > 1:
    page_url = page_url + "_" + str(page_index)
  page_url = page_url + ".html"
  print "News page link = " + page_url

  try:
    news_img_page_context = urllib2.urlopen(page_url)
  except urllib2.URLError,e:
    print e.reason
    download_suc = False
    break
  
  news_img_page = news_img_page_context.read()

  #f = open("e:\\page.html", "w")
  #f.write(news_img_page)
  #f.close()

  reg_str = r'http://image\S+jpg'
  image_reg = re.compile(reg_str)
  image_results = image_reg.findall(news_img_page)
  if len(image_results) == 0:
    print "Cannot find news page" + str(page_index) + "!"
    download_suc = False
    break
  
  image_url = image_results[0]

  print "News image url = " + image_url
  news_image_context = urllib2.urlopen(image_url)

  image_name = image_folder + "page_" + str(page_index) + ".jpg"
  imgf = open(image_name, 'wb')
  print "Getting image..."
  try:
    while True:
      date = news_image_context.read(1024*10)
      if not date:
        break
      imgf.write(date)
    imgf.close()
  except:
    download_suc = False
    print "Save image " + str(page_index) + " failed!"
    print "Unexpected error: " + sys.exc_info()[0] + sys.exc_info()[1]
  else:
    print "Save image " + str(page_index) + " succeed!"
    print
  page_index = page_index + 1

if download_suc == True:
  print "News download succeed! Path = \"" + str(image_folder) + "\""
  print "Enjoy it! ^^"
else:
  print "news download failed!"
Python 相关文章推荐
python使用Image处理图片常用技巧分析
Jun 01 Python
Python自动化运维_文件内容差异对比分析
Dec 13 Python
python 读取txt中每行数据,并且保存到excel中的实例
Apr 29 Python
Python RabbitMQ消息队列实现rpc
May 30 Python
详解Python3的TFTP文件传输
Jun 26 Python
使用python中的in ,not in来检查元素是不是在列表中的方法
Jul 06 Python
使用selenium和pyquery爬取京东商品列表过程解析
Aug 15 Python
pywinauto自动化操作记事本
Aug 26 Python
python做接口测试的必要性
Nov 20 Python
python 爬取B站原视频的实例代码
Sep 09 Python
python matplotlib工具栏源码探析三之添加、删除自定义工具项的案例详解
Feb 25 Python
python 将Excel转Word的示例
Mar 02 Python
基于Python实现的扫雷游戏实例代码
Aug 01 #Python
python脚本实现查找webshell的方法
Jul 31 #Python
用python删除java文件头上版权信息的方法
Jul 31 #Python
Python datetime时间格式化去掉前导0
Jul 31 #Python
python处理文本文件并生成指定格式的文件
Jul 31 #Python
Python中关键字is与==的区别简述
Jul 31 #Python
python处理文本文件实现生成指定格式文件的方法
Jul 31 #Python
You might like
php操作excel文件 基于phpexcel
2010/07/02 PHP
ThinkPHP中的关联模型注意点
2014/06/16 PHP
php中文字符串截取方法实例总结
2014/09/30 PHP
PHP操作MySQL事务实例
2014/11/05 PHP
CodeIgniter控制器之业务逻辑实例分析
2016/01/20 PHP
PHP实现网站访问量计数器
2017/10/27 PHP
php查询内存信息操作示例
2019/05/09 PHP
js右键菜单效果代码
2007/07/21 Javascript
JS 无法通过W3C验证的处理方法
2010/03/09 Javascript
JS小功能(offsetLeft实现图片滚动效果)实例代码
2013/11/28 Javascript
html5+javascript制作简易画板附图
2014/04/25 Javascript
深入理解JavaScript函数参数(推荐)
2016/07/26 Javascript
jquery pagination分页插件使用详解(后台struts2)
2017/01/22 Javascript
100多个基础常用JS函数和语法集合大全
2017/02/16 Javascript
JS实现图片居中悬浮效果
2017/12/25 Javascript
解决Angular.js中使用Swiper插件不能滑动的问题
2018/02/26 Javascript
webpack本地开发环境无法用IP访问的解决方法
2018/03/20 Javascript
浅析vue中常见循环遍历指令的使用 v-for
2018/04/18 Javascript
vue弹窗消息组件的使用方法
2020/09/24 Javascript
JS实现的全选、全不选及反选功能【案例】
2019/02/19 Javascript
v-slot和slot、slot-scope之间相互替换实例
2020/09/04 Javascript
[03:11]完美世界DOTA2联赛PWL DAY8集锦
2020/11/09 DOTA
[49:12]完美世界DOTA2联赛PWL S2 Magma vs GXR 第二场 11.29
2020/12/02 DOTA
Python操作MySQL数据库的方法
2018/06/20 Python
Python适配器模式代码实现解析
2019/08/02 Python
超实用的 30 段 Python 案例
2019/10/10 Python
速卖通欧盟:Aliexpress EU
2020/08/19 全球购物
ruby如何进行集成操作?Ruby能进行多重继承吗?
2013/10/16 面试题
经理秘书岗位职责
2013/11/14 职场文书
出国导师推荐信
2014/01/16 职场文书
运动会通讯稿50字
2014/01/30 职场文书
自立自强的名人事例
2014/02/10 职场文书
2015年审计人员工作总结
2015/05/26 职场文书
爱国主义影片观后感
2015/06/18 职场文书
MySQL中distinct与group by之间的性能进行比较
2021/05/26 MySQL
自从在 IDEA 中用了热部署神器 JRebel 之后,开发效率提升了 10(真棒)
2021/06/26 Java/Android