Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
Python中urllib2模块的8个使用细节分享
Jan 01 Python
Python的批量远程管理和部署工具Fabric用法实例
Jan 23 Python
Python中实现对Timestamp和Datetime及UTC时间之间的转换
Apr 08 Python
Python基于有道实现英汉字典功能
Jul 25 Python
tensorflow 输出权重到csv或txt的实例
Jun 14 Python
Python列表推导式与生成器用法分析
Aug 02 Python
Python 使用PIL中的resize进行缩放的实例讲解
Aug 03 Python
python的concat等多种用法详解
Nov 28 Python
Python生成指定数量的优惠码实操内容
Jun 18 Python
Python Tornado之跨域请求与Options请求方式
Mar 28 Python
Python错误的处理方法
Jun 23 Python
详解用selenium来下载小姐姐图片并保存
Jan 26 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
简单解析PHP程序的运行流程
2016/06/23 PHP
Yii框架创建cronjob定时任务的方法分析
2017/05/23 PHP
PHPCrawl爬虫库实现抓取酷狗歌单的方法示例
2017/12/21 PHP
PHP实现提取多维数组指定一列的方法总结
2019/12/04 PHP
初窥JQuery-Jquery简介 入门了解篇
2010/11/25 Javascript
firefox下input type="file"的size是多大
2011/10/24 Javascript
javascript-简单的计算器实现步骤分解(附图)
2013/05/30 Javascript
JavaScript实现数字数组按照倒序排列的方法
2015/04/06 Javascript
JS拖拽组件学习使用
2016/01/19 Javascript
基于Node.js实现nodemailer邮件发送
2016/01/26 Javascript
Vuejs 页面的区域化与组件封装的实现
2017/09/11 Javascript
bootstrap响应式导航条模板使用详解(含下拉菜单,弹出框)
2017/11/17 Javascript
JS实现带动画的回到顶部效果
2017/12/28 Javascript
简述vue中的config配置
2018/01/23 Javascript
react的滑动图片验证码组件的示例代码
2019/02/27 Javascript
JS实现横向轮播图(中级版)
2020/01/18 Javascript
vue 动态生成拓扑图的示例
2021/01/03 Vue.js
iview实现动态表单和自定义验证时间段重叠
2021/01/10 Javascript
python使用mailbox打印电子邮件的方法
2015/04/30 Python
Linux RedHat下安装Python2.7开发环境
2017/05/20 Python
Python更新数据库脚本两种方法及对比介绍
2017/07/27 Python
Python3基于sax解析xml操作示例
2018/05/22 Python
pandas 按照特定顺序输出的实现代码
2018/07/10 Python
Python将列表数据写入文件(txt, csv,excel)
2019/04/03 Python
Django集成celery发送异步邮件实例
2019/12/17 Python
服装店营销方案
2014/03/10 职场文书
党务公开方案
2014/05/06 职场文书
学校志愿者活动总结
2014/06/27 职场文书
走进敬老院活动总结
2014/07/10 职场文书
公司法人授权委托书范本
2014/09/12 职场文书
2014教师教育实践活动对照检查材料思想汇报
2014/09/21 职场文书
2014年学生会生活部工作总结
2014/11/07 职场文书
一个都不能少观后感
2015/06/04 职场文书
2016年度基层党建工作公开承诺书
2016/03/25 职场文书
python中redis包操作数据库的教程
2022/04/19 Python
Java实现超大Excel文件解析(XSSF,SXSSF,easyExcel)
2022/07/15 Java/Android