Python通过解析网页实现看报程序的方法


Posted in Python onAugust 04, 2014

本文所述实例可以实现基于Python的查看图片报纸《参考消息》并将当天的图片报纸自动下载到本地供查看的功能,具体实现代码如下:

# coding=gbk
import urllib2
import socket
import re
import time
import os

# timeout in seconds
#timeout = 10
#socket.setdefaulttimeout(timeout)
timeout = 10
urllib2.socket.setdefaulttimeout(timeout)

home_url = "http://www.hqck.net"
home_page = ""
try:
  home_page_context = urllib2.urlopen(home_url)
  home_page = home_page_context.read()

  print "Read home page finishd."
  print "-------------------------------------------------"
except urllib2.URLError,e:
  print e.code
  exit()
except:
  print e.code
  exit()

reg_str = r'<a class="item-baozhi" href="/arc/jwbt/ckxx/\d{4}/\d{4}/\w+\.html" rel="external nofollow" ><span class.+>.+</span></a>'

news_url_reg = re.compile(reg_str)

today_cankao_news = news_url_reg.findall(home_page)

if len(today_cankao_news) == 0:
  print "Cannot find today's news!"
  exit()

my_news = today_cankao_news[0]
print "Latest news link = " + my_news
print

url_s = my_news.find("/arc/")
url_e = my_news.find(".html")
url_e = url_e + 5

print "Link index = [" + str(url_s) + "," + str(url_e) + "]"
my_news = my_news[url_s:url_e]
print "part url = " + my_news

full_news_url = home_url + my_news
print "full url = " + full_news_url
print

image_folder = "E:\\new_folder\\"

if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
today_num = time.strftime('%Y-%m-%d',time.localtime(time.time()))
image_folder = image_folder + today_num + "\\"
if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
print "News image folder = " + image_folder
print

context_uri = full_news_url[0:-5]

first_page_url = context_uri + ".html"
try:
  first_page_context = urllib2.urlopen(first_page_url)
  first_page = first_page_context.read()
except urllib2.HTTPError, e:
  print e.code
  exit()

tot_page_index = first_page.find("共")
tot_page_index = tot_page_index

tmp_str = first_page[tot_page_index:tot_page_index+10]
end_s = tmp_str.find("页")

page_num = tmp_str[2:end_s]
print page_num

page_count = int(page_num)
print "Total " + page_num + " pages:"
print

page_index = 1
download_suc = True
while page_index <= page_count:
  page_url = context_uri
  if page_index > 1:
    page_url = page_url + "_" + str(page_index)
  page_url = page_url + ".html"
  print "News page link = " + page_url

  try:
    news_img_page_context = urllib2.urlopen(page_url)
  except urllib2.URLError,e:
    print e.reason
    download_suc = False
    break
  
  news_img_page = news_img_page_context.read()

  #f = open("e:\\page.html", "w")
  #f.write(news_img_page)
  #f.close()

  reg_str = r'http://image\S+jpg'
  image_reg = re.compile(reg_str)
  image_results = image_reg.findall(news_img_page)
  if len(image_results) == 0:
    print "Cannot find news page" + str(page_index) + "!"
    download_suc = False
    break
  
  image_url = image_results[0]

  print "News image url = " + image_url
  news_image_context = urllib2.urlopen(image_url)

  image_name = image_folder + "page_" + str(page_index) + ".jpg"
  imgf = open(image_name, 'wb')
  print "Getting image..."
  try:
    while True:
      date = news_image_context.read(1024*10)
      if not date:
        break
      imgf.write(date)
    imgf.close()
  except:
    download_suc = False
    print "Save image " + str(page_index) + " failed!"
    print "Unexpected error: " + sys.exc_info()[0] + sys.exc_info()[1]
  else:
    print "Save image " + str(page_index) + " succeed!"
    print
  page_index = page_index + 1

if download_suc == True:
  print "News download succeed! Path = \"" + str(image_folder) + "\""
  print "Enjoy it! ^^"
else:
  print "news download failed!"
Python 相关文章推荐
Python字符遍历的艺术
Sep 06 Python
Python实现爬取需要登录的网站完整示例
Aug 19 Python
Django 限制用户访问频率的中间件的实现
Aug 23 Python
pybind11和numpy进行交互的方法
Jul 04 Python
python基于K-means聚类算法的图像分割
Oct 30 Python
Pandas实现dataframe和np.array的相互转换
Nov 30 Python
Python底层封装实现方法详解
Jan 22 Python
python实现爱奇艺登陆密码RSA加密的方法示例详解
May 27 Python
Python logging日志模块 配置文件方式
Jul 12 Python
Python使用pyexecjs代码案例解析
Jul 13 Python
Python3爬虫关于代理池的维护详解
Jul 30 Python
如何编写python的daemon程序
Jan 07 Python
基于Python实现的扫雷游戏实例代码
Aug 01 #Python
python脚本实现查找webshell的方法
Jul 31 #Python
用python删除java文件头上版权信息的方法
Jul 31 #Python
Python datetime时间格式化去掉前导0
Jul 31 #Python
python处理文本文件并生成指定格式的文件
Jul 31 #Python
Python中关键字is与==的区别简述
Jul 31 #Python
python处理文本文件实现生成指定格式文件的方法
Jul 31 #Python
You might like
2017年最好用的9个php开发工具推荐(超好用)
2017/10/23 PHP
php apache开启跨域模式过程详解
2019/07/08 PHP
javascript模块化是什么及其优缺点介绍
2013/09/02 Javascript
javascript分页代码实例分享(js分页)
2013/12/13 Javascript
jQuery中:file选择器用法实例
2015/01/04 Javascript
jquery.mousewheel实现整屏翻屏效果
2015/08/30 Javascript
jQuery图片轮播滚动切换代码分享
2020/04/20 Javascript
微信小程序  网络请求API详解
2016/10/25 Javascript
vue之nextTick全面解析
2017/05/17 Javascript
详细介绍RxJS在Angular中的应用
2017/09/23 Javascript
Vue2.0 slot分发内容与props验证的方法
2017/12/12 Javascript
axios向后台传递数组作为参数的方法
2018/08/11 Javascript
详解Node.js amqplib 连接 Rabbit MQ最佳实践
2019/01/24 Javascript
js微信分享接口调用详解
2019/07/23 Javascript
[03:52]显微镜下的DOTA2第三期——英雄在无聊的时候干什么
2014/06/20 DOTA
[51:26]DOTA2上海特级锦标赛主赛事日 - 2 胜者组第一轮#3Secret VS OG第二局
2016/03/03 DOTA
Python基础教程之tcp socket编程详解及简单实例
2017/02/23 Python
Python3.7安装keras和TensorFlow的教程图解
2020/06/18 Python
Python完全识别验证码自动登录实例详解
2019/11/24 Python
Python函数必须先定义,后调用说明(函数调用函数例外)
2020/06/02 Python
Python Django中间件使用原理及流程分析
2020/06/13 Python
python使用requests库爬取拉勾网招聘信息的实现
2020/11/20 Python
浅谈css3新单位vw、vh、vmin、vmax的使用详解
2017/12/01 HTML / CSS
Nili Lotan官网:Nili Lotan同名品牌
2018/01/07 全球购物
团组织关系介绍信
2014/01/12 职场文书
年度考核评语
2014/01/19 职场文书
初一家长会邀请函
2014/01/31 职场文书
小区物业门卫岗位职责
2014/04/10 职场文书
《鹬蚌相争》教学反思
2014/04/22 职场文书
幼儿教师求职信
2014/05/24 职场文书
积极向上的团队口号
2014/06/06 职场文书
挂靠协议书
2015/01/27 职场文书
升职自荐信怎么写
2015/03/05 职场文书
创卫工作总结2015
2015/04/22 职场文书
python 实现图片特效处理
2022/04/03 Python
《帝国时代4》赛季预告 新增内容编译器可创造地图
2022/04/03 其他游戏