在centos7中分布式部署pyspider


Posted in Python onMay 03, 2017

1.搭建环境:

系统版本:Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

python版本:Python 3.5.1

1.1.搭建python3环境:

本人在尝试过后选择集成环境Anaconda

1.1.1.编译

# 下载依赖
yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve
# 下载python版本
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz
# 或者使用国内源
wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz
mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src
# 解压
tar -zxf Python-3.5.1.tgz;cd Python-3.5.1
# 编译安装
./configure --prefix=/usr/local/python3.5 --enable-shared
make && make install
# 建立软链接
ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3
echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf
ldconfig
# 验证python3
python3
# Python 3.5.1 (default, Oct 9 2016, 11:44:24)
# [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
# Type "help", "copyright", "credits" or "license" for more information.
# >>>
# pip
/usr/local/python3.5/bin/pip3 install --upgrade pip
ln -s /usr/local/python3.5/bin/pip /usr/bin/pip
# 本人在安装时出现问题 将pip重装
wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate
python get-pip.py

1.1.2.集成环境anaconda

# 集成环境anaconda(推荐)
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
# 直接安装即可
./Anaconda3-4.2.0-Linux-x86_64.sh
# 若出错,可能是解压失败
yum install bzip2

1.2.安装mariaDB

# 安装
yum -y install mariadb mariadb-server
# 启动
systemctl start mariadb
# 设置为开机启动
systemctl enable mariadb
# 配置密码 默认为空
mysql_secure_installation
# 登录
mysql -u root -p
# 创建一个用户 自己设定账户密码
CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION;
CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;

1.3.安装pyspider

本人使用Anaconda

# 搭建虚拟环境sbird python版本3.*
conda create -n sbird python=3*
# 进入环境
source activate sbird
# 安装pyspider
pip install pyspider
# 报错 
# it does not exist. The exported locale is "en_US.UTF-8" but it is not supported
# 执行 可写入.bashrc
export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8
#ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0)
conda install pycurl
# 退出
source deactivate sbird
# 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙
systemctl stop firewalld.service
#########直接运行源码==============
mkdir git;cd git
# 下载
git clone https://github.com/binux/pyspider.git
# 安装
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py

其他方法

# 搭建虚拟环境
pip install virtualenv
mkdir python;cd python
# 创建虚拟环境pyenv3
virtualenv -p /usr/bin/python3 pyenv3
# 进入虚拟环境 激活环境
cd pyenv3/
source ./bin/activate
pip install pyspider
# 若pycurl报错 
yum install libcurl-devel
# 继续
pip install pyspider
# 关闭
deactivate

本人推荐用anaconda方式安装

若pyspider运行过程中出现错误,参考anaconda安装部分,至此,访问localhost:5000可看到页面。

1.4.安装Supervisor

# 安装
yum install supervisor -y
# 若无法检索 则添加阿里的epel源
vim /etc/yum.repos.d/epel.repo
# 添加以下内容
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=http://mirrors.aliyun.com/epel/7/$basearch
http://mirrors.aliyuncs.com/epel/7/$basearch
failovermethod=priority
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug
http://mirrors.aliyuncs.com/epel/7/$basearch/debug
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=http://mirrors.aliyun.com/epel/7/SRPMS
http://mirrors.aliyuncs.com/epel/7/SRPMS
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0
# 安装
yum install supervisor -y
# 测试是否安装成功
echo_supervisord_conf

1.4.1.Supervisor用法

supervisord   #supervisor的服务器端部分 启动
supervisorctl  #启动supervisor的命令行窗口
# 假设创建进程pyspider01
vim /etc/supervisord.d/pyspider01.ini
# 写入以下内容
[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 启动
supervisorctl start pyspider01
# 也可这样启动
supervisord -c /etc/supervisord.conf
# 查看状态
supervisorctl status
# output 
pyspider01            RUNNING  pid 4026, uptime 0:02:40
# 关闭
supervisorctl shutdown

1.5.安装redis

# 消息队列采用redis
mkdir download;cd download
wget http://download.redis.io/releases/redis-3.2.4.tar.gz
tar xzf redis-3.2.4.tar.gz
cd redis-3.2.4
make
# 或者直接yum安装
yum -y install redis
# 启动
systemctl start redis.service
# 重启
systemctl restart redis.service
# 停止
systemctl stop redis.service
# 查看状态
systemctl status redis.service
# 更改文件/etc/redis.conf
vim /etc/redis.conf
# 更改内容
daemonize no 改为 daemonize yes
bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip)
# 重启redis
systemctl restart redis.service

1.6.关于自启动

# Supervisor添加到自启动服务
systemctl enable supervisord.service
# redis添加到自启动服务
systemctl enable redis.service
# 关闭防火墙自启动
systemctl disable firewalld.service

至此,pyspider单个服务器运行环境搭建且部署完毕,启动localhost:5000进入web界面。

也可编写脚本运行,在/pyspider/supervisor/pyspider01.log查看运行状态。

2.分布式部署

刚才配置的服务器,将其命名为centos01,按照这样的配置,再分别部署两台centos02、centos03。

如下:

服务器名称 ip 说明

centos01 10.211.55.22 redis,mariaDB, scheduler
centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs
centos03 10.211.55.24 fetcher, processor,,result_worker,webui

2.1.centos01

进入服务器centos01,经过第一步,基本环境已经搭好,首先编辑配置文件/pyspider/config.json

{
 "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb",
 "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb",
 "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb",
 "message_queue": "redis://10.211.55.22:6379/db",
 "logging-config": "/pyspider/logging.conf",
 "phantomjs-proxy":"10.211.55.23:25555",
 "webui": {
  "username": "",
  "password": "",
  "need-auth": false,
  "host":"10.211.55.24",
  "port":"5000",
  "scheduler-rpc":"http:// 10.211.55.22:5002",
  "fetcher-rpc":"http://10.211.55.23:5001"
 },
 "fetcher": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5001"
 },
 "scheduler": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5002"
 }
}

尝试运行下:

/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 报错
ImportError: No module named 'mysql'
# 下载 mysql-connector-python
cd ~/git/
git clone https://github.com/mysql/mysql-connector-python.git
# 安装
source activate sbird
cd mysql-connector-python
python setup.py install
# 安装redis
pip install redis
source deactivate
# 运行
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 输出 ok
[I 161010 15:57:25 scheduler:644] scheduler starting...
[I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002
[I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0

运行成功后,可直接更改/etc/supervisord.d/pyspider01.ini如下:

[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status

centos01部署完毕。

2.2.centos02

在centos02中,需要运行result_worker、processor、phantomjs、fetcher

分别建立文件:

/etc/supervisord.d/result_worker.ini

[program:result_worker]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/result_worker.log
/etc/supervisord.d/processor.ini

[program:processor]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/processor.log
/etc/supervisord.d/phantomjs.ini

[program:phantomjs]

command   = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/phantomjs.log
/etc/supervisord.d/fetcher.ini

[program:fetcher]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/fetcher.log

在pyspider目录中建立pjsconfig.json

{
 /*--ignore-ssl-errors=true */
 "ignoreSslErrors": true,

 /*--ssl-protocol=true */
 "sslprotocol": "any",

 /* Same as: --output-encoding=utf8 */
 "outputEncoding": "utf8",

 /* persistent Cookies. */
 /*cookiesfile="e:/phontjscookies.txt",*/
 cookiesfile="pyspider/phontjscookies.txt",

 /* load image */
 autoLoadImages = false
}

下载phantomjs至/pyspider/文件夹,将git/pyspider/pyspider/fetcher/phantomjs_fetcher.js复制到phantomjs_fetcher.js

# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 3446, uptime 0:00:07
phantomjs            RUNNING  pid 3448, uptime 0:00:07
processor            RUNNING  pid 3447, uptime 0:00:07
result_worker          RUNNING  pid 3445, uptime 0:00:07

centos02部署完毕。

2.3.centos03

部署这三个进程fetcher, processor, result_worker和centos02 一样,本服务器主要是在前面的基础上加上webui

建立文件:

/etc/supervisord.d/webui.ini

[program:webui]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/webui.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 2724, uptime 0:00:07
processor            RUNNING  pid 2725, uptime 0:00:07
result_worker          RUNNING  pid 2723, uptime 0:00:07
webui              RUNNING  pid 2726, uptime 0:00:07

3.总结

访问 http://10.211.55.24:5000 即可,尽情爬取吧。

Python 相关文章推荐
Python性能优化的20条建议
Oct 25 Python
Python实现简单HTML表格解析的方法
Jun 15 Python
Python3.6.0+opencv3.3.0人脸检测示例
May 25 Python
NumPy 基本切片和索引的具体使用方法
Apr 24 Python
Python3实现定时任务的四种方式
Jun 03 Python
基于python的socket实现单机五子棋到双人对战
Mar 24 Python
50行Python代码获取高考志愿信息的实现方法
Jul 23 Python
Python 多线程搜索txt文件的内容,并写入搜到的内容(Lock)方法
Aug 23 Python
Flask框架请求钩子与request请求对象用法实例分析
Nov 07 Python
django之从html页面表单获取输入的数据实例
Mar 16 Python
python中的unittest框架实例详解
Feb 05 Python
尝试使用Python爬取城市租房信息
Apr 12 Python
python3读取MySQL-Front的MYSQL密码
May 03 #Python
Python判断变量是否为Json格式的字符串示例
May 03 #Python
Windows和Linux下Python输出彩色文字的方法教程
May 02 #Python
python中字符串类型json操作的注意事项
May 02 #Python
python实现逻辑回归的方法示例
May 02 #Python
pycharm中连接mysql数据库的步骤详解
May 02 #Python
Python多线程实现同步的四种方式
May 02 #Python
You might like
深入理解PHP之require/include顺序 推荐
2011/01/02 PHP
php更新修改excel中的内容实例代码
2014/02/26 PHP
PHP exif扩展方法开启详解
2014/07/28 PHP
php 中self,this的区别和操作方法实例分析
2019/11/04 PHP
Javascript 变量作用域 两个可能会被忽略的小特性
2010/03/23 Javascript
关于javascript中的typeof和instanceof介绍
2012/12/04 Javascript
IE6浏览器下resize事件被执行了多次解决方法
2012/12/11 Javascript
div模拟选择框示例代码
2013/11/03 Javascript
WEB前端设计师常用工具集锦
2014/12/09 Javascript
jQuery实现Flash效果上下翻动的中英文导航菜单代码
2015/09/22 Javascript
JavaScript正则表达式的分组匹配详解
2016/02/13 Javascript
js传值后台中文出现乱码的解决方法
2016/06/30 Javascript
Node.js调试技术总结分享
2017/03/12 Javascript
JS实现两周内自动登录功能
2017/03/23 Javascript
selenium 与 chrome 进行qq登录并发邮件操作实例详解
2017/04/06 Javascript
React注册倒计时功能的实现
2018/09/06 Javascript
在Bootstrap开发框架中使用dataTable直接录入表格行数据的方法
2018/10/25 Javascript
vue如何截取字符串
2019/05/06 Javascript
VUEX-action可以修改state吗
2019/11/19 Javascript
简单了解前端渐进式框架VUE
2020/07/20 Javascript
python3生成随机数实例
2014/10/20 Python
Python基于PyGraphics包实现图片截取功能的方法
2017/12/21 Python
Python使用numpy模块创建数组操作示例
2018/06/20 Python
Python去除字符串前后空格的几种方法
2019/03/04 Python
浅析Python 读取图像文件的性能对比
2019/03/07 Python
pyqt实现.ui文件批量转换为对应.py文件脚本
2019/06/19 Python
Python数据类型之列表和元组的方法实例详解
2019/07/08 Python
Flask框架模板渲染操作简单示例
2019/07/31 Python
Flask之pipenv虚拟环境的实现
2019/11/26 Python
pyftplib中文乱码问题解决方案
2020/01/11 Python
Python unittest装饰器实现原理及代码
2020/09/08 Python
Python 实现RSA加解密文本文件
2020/12/30 Python
pytorch 实现L2和L1正则化regularization的操作
2021/03/03 Python
CSS3 实现发光边框特效
2020/11/11 HTML / CSS
《富饶的西沙群岛》教学反思
2014/04/09 职场文书
安装Windows Server 2012 R2企业版操作系统并设置好相关参数
2022/04/29 Servers