编程 Python

在Python中使用mechanize模块模拟浏览器功能

Posted in Python onMay 05, 2015

知道如何快速在命令行或者python脚本中实例化一个浏览器通常是非常有用的。
每次我需要做任何关于web的自动任务时，我都使用这段python代码去模拟一个浏览器。

import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
#br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

现在你得到了一个浏览器的示例，br对象。使用这个对象，便可以打开一个页面，使用类似如下的代码:

# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('http://google.com')
html = r.read()
# Show the source
print html
# or
print br.response().read()
# Show the html title
print br.title()
# Show the response headers
print r.info()
# or
print br.response().info()
# Show the available forms
for f in br.forms():
  print f
# Select the first (index zero) form
br.select_form(nr=0)
# Let's search
br.form['q']='weekend codes'
br.submit()
print br.response().read()
# Looking at some results in link format
for l in br.links(url_regex='stockrt'):
  print l

如果你访问的网站需要验证(http basic auth),那么:

# If the protected site didn't receive the authentication data you would
# end up with a 410 error in your face
br.add_password('http://safe-site.domain', 'username', 'password')
br.open('http://safe-site.domain')

由于之前使用了Cookie Jar，你不需要管理网站的登录session。也就是不需要管理需要POST一个用户名和密码的情况。
通常这种情况，网站会请求你的浏览器去存储一个session cookie除非你重复登陆，
而导致你的cookie中含有这个字段。所有这些事情，存储和重发这个session cookie已经被Cookie Jar搞定了，爽吧。
同时，你可以管理你的浏览器历史:

# Testing presence of link (if the link is not found you would have to
# handle a LinkNotFoundError exception)
br.find_link(text='Weekend codes')
# Actually clicking the link
req = br.click_link(text='Weekend codes')
br.open(req)
print br.response().read()
print br.geturl()
# Back
br.back()
print br.response().read()
print br.geturl()

下载一个文件:

# Download
f = br.retrieve('http://www.google.com.br/intl/pt-BR_br/images/logo.gif')[0]
print f
fh = open(f)

为http设置代理

# Proxy and user/password
br.set_proxies({"http": "joe:password@myproxy.example.com:3128"})
# Proxy
br.set_proxies({"http": "myproxy.example.com:3128"})
# Proxy password
br.add_proxy_password("joe", "password")

但是，如果你只想要打开网页，而不需要之前所有神奇的功能，那你可以:

# Simple open?
import urllib2
print urllib2.urlopen('http://stockrt.github.com').read()
# With password?
import urllib
opener = urllib.FancyURLopener()
print opener.open('http://user:password@stockrt.github.com').read()

你可以通过 mechanize官方网站， mechanize文档和ClientForm的文档了解更多。

原文来自：http://reyoung.me/index.php/2012/08/08/%E7%BF%BB%E8%AF%91%E4%BD%BF%E7%94%A8python%E6%A8%A1%E4%BB%BF%E6%B5%8F%E8%A7%88%E5%99%A8%E8%A1%8C%E4%B8%BA/

——————————————————————————————

最后来聊下通过代码访问页面时的一个很重要的概念和技术：cookie

我们都知道HTTP是无连接的状态协议，但是客户端和服务器端需要保持一些相互信息，比如cookie，有了cookie，服务器才能知道刚才是这个用户登录了网站，才会给予客户端访问一些页面的权限。
比如用浏览器登录新浪微博，必须先登录，登陆成功后，打开其他的网页才能够访问。用程序登录新浪微博或其他验证网站，关键点也在于需要保存cookie，之后附带cookie再来访问网站，才能够达到效果。
这里就需要Python的cookielib和urllib2等的配合，将cookielib绑定到urllib2在一起，就能够在请求网页的时候附带cookie。
具体做法，首先第一步，用firefox的httpfox插件，在浏览器衷开始浏览新浪微博首页，然后登陆，从httpfox的记录中，查看每一步发送了那些数据请求了那个URL；之后再python里面，模拟这个过程，用urllib2.urlopen发送用户名密码到登陆页面，获取登陆后的cookie，之后访问其他页面，获取微博数据。

cookielib模块的主要作用是提供可存储cookie的对象，以便于与urllib2模块配合使用来访问Internet资源。例如可以利用本模块的CookieJar类的对象来捕获cookie并在后续连接请求时重新发送。coiokielib模块用到的对象主要有下面几个：CookieJar、FileCookieJar、MozillaCookieJar、LWPCookieJar。
urllib模块和urllib模块类似，用来打开URL并从中获取数据。与urllib模块不同的是，urllib模块不仅可以使用urlopen()函数还可以自定义Opener来访问网页。同时要注意：urlretrieve()函数是urllib模块中的，urllib2模块中不存在该函数。但是使用urllib2模块时一般都离不开urllib模块，因为POST的数据需要使用urllib.urlencode()函数来编码。

cookielib模块一般与urllib2模块配合使用，主要用在urllib2.build_oper()函数中作为urllib2.HTTPCookieProcessor()的参数。使用方法如下面登录人人网的代码：

#! /usr/bin/env python
#coding=utf-8
import urllib2
import urllib
import cookielib
data={"email":"用户名","password":"密码"} #登陆用户名和密码
post_data=urllib.urlencode(data)
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
headers ={"User-agent":"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1"}
req=urllib2.Request("http://www.renren.com/PLogin.do",post_data,headers)
content=opener.open(req)
print content.read().decode("utf-8").encode("gbk")

具体请参考：

http://www.crazyant.net/796.html Python使用cookielib和urllib2模拟登陆新浪微博并抓取数据

http://my.oschina.net/duhaizhang/blog/69342 urllib2模块

https://docs.python.org/2/library/cookielib.html cookielib — Cookie handling for HTTP clients

在Python中使用mechanize模块模拟浏览器功能

- Author -

xrzs

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python matplotlib中文显示参数设置解析

Dec 15 Python

numpy中实现ndarray数组返回符合特定条件的索引方法

Apr 17 Python

python爬虫获取百度首页内容教学

Dec 23 Python

利用Python+阿里云实现DDNS动态域名解析的方法

Apr 01 Python

简单了解python单例模式的几种写法

Jul 01 Python

python3字符串操作总结

Jul 24 Python

Django用数据库表反向生成models类知识点详解

Mar 25 Python

在echarts中图例legend和坐标系grid实现左右布局实例

May 16 Python

python实现一次性封装多条sql语句(begin end)

Jun 06 Python

Python实现手绘图效果实例分享

Jul 22 Python

python批量检查两个对应的txt文件的行数是否一致的实例代码

Oct 31 Python

Python获取字典中某个key的value

Apr 13 Python

python实现自动更换ip的方法

May 05 #Python

浅谈MySQL中的触发器

May 05 #Python

python去除所有html标签的方法

May 05 #Python

python实现将汉字转换成汉语拼音的库

May 05 #Python

python基于Tkinter库实现简单文本编辑器实例

May 05 #Python

python实现的简单窗口倒计时界面实例

May 05 #Python

给Python中的MySQLdb模块添加超时功能的教程

May 05 #Python

You might like

提取HTML标签

2006/10/09 PHP

探讨:如何使用PhpDocumentor生成文档

2013/06/25 PHP

ThinkPHP分页类使用详解

2014/03/05 PHP

ThinkPHP调用common/common.php函数提示错误function undefined的解决方法

2014/08/25 PHP

PHP YII框架开发小技巧之模型(models)中rules自定义验证规则

2015/11/16 PHP

利用php抓取蜘蛛爬虫痕迹的示例代码

2016/09/30 PHP

PHPCMS V9 添加二级导航的思路详解

2016/10/20 PHP

浅谈php中变量的数据类型判断函数

2017/03/04 PHP

postman的安装与使用方法(模拟Get和Post请求)

2018/08/06 PHP

IE6下通过a标签点击切换图片的问题

2010/11/14 Javascript

鼠标移动到图片名上,显示图片的简单实例

2013/07/14 Javascript

12种不宜使用的Javascript语法整理

2013/11/04 Javascript

javascript按位非运算符的使用方法

2013/11/14 Javascript

jQuery操作JSON的CRUD用法实例

2015/02/25 Javascript

浅谈Jquery为元素绑定事件

2015/04/27 Javascript

浅谈JavaScript字符串拼接

2015/06/25 Javascript

js实现图片轮播效果

2015/12/19 Javascript

jQuery Ajax请求后台数据并在前台接收

2016/12/10 Javascript

jQuery中的100个技巧汇总

2016/12/15 Javascript

jQuery插件FusionCharts绘制的3D环饼图效果示例【附demo源码】

2017/04/02 jQuery

分析JavaScript数组操作难点

2017/12/18 Javascript

基于jquery的on和click的区别详解

2018/01/15 jQuery

JavaScript 日期时间选择器一些小结

2018/04/02 Javascript

浅谈React 服务器端渲染的使用

2018/05/08 Javascript

jQuery插件实现的日历功能示例【附源码下载】

2018/09/07 jQuery

微信小程序之侧边栏滑动实现过程解析（附完整源码）

2019/08/23 Javascript

Vue2.x和Vue3.x的双向绑定原理详解

2020/11/05 Javascript

[02:31]DOTA2帕克英雄基础教程

2013/11/26 DOTA

[01:19:34]2014 DOTA2国际邀请赛中国区预选赛 New Element VS Dream time

2014/05/22 DOTA

[49:18]2018DOTA2亚洲邀请赛 3.31 小组赛 A组 OG vs TNC

2018/04/01 DOTA

用Python操作字符串之rindex()方法的使用

2015/05/19 Python

Python在线运行代码助手

2016/07/15 Python

UGG英国官方网站：UGG UK

2018/02/08 全球购物

Lookfantastic阿联酋官网：英国知名美妆护肤购物网站

2020/05/26 全球购物

老人祝寿主持词

2014/03/28 职场文书

开会通知

2015/04/20 职场文书