Python大数据之从网页上爬取数据的方法详解


Posted in Python onNovember 16, 2019

本文实例讲述了Python大数据之从网页上爬取数据的方法。分享给大家供大家参考,具体如下:

Python大数据之从网页上爬取数据的方法详解

myspider.py  :

#!/usr/bin/python
# -*- coding:utf-8 -*-
from scrapy.spiders import Spider
from lxml import etree
from jredu.items import JreduItem
class JreduSpider(Spider):
  name = 'tt' #爬虫的名字,必须的,唯一的
  allowed_domains = ['sohu.com']
  start_urls = [
    'http://www.sohu.com'
  ]
  def parse(self, response):
    content = response.body.decode('utf-8')
    dom = etree.HTML(content)
    for ul in dom.xpath("//div[@class='focus-news-box']/div[@class='list16']/ul"):
      lis = ul.xpath("./li")
      for li in lis:
        item = JreduItem() #定义对象
        if ul.index(li) == 0:
          strong = li.xpath("./a/strong/text()")
          li.xpath("./a/@href")
          item['title']= strong[0]
          item['href'] = li.xpath("./a/@href")[0]
        else:
          la = li.xpath("./a[last()]/text()")
          item['title'] = la[0]
          item['href'] = li.xpath("./a[last()]/href")[0]
        yield item

items.py    :

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class JreduItem(scrapy.Item):#相当于Java里的实体类
  # define the fields for your item here like:
  # name = scrapy.Field()
  title = scrapy.Field()#创建一个field对象
  href = scrapy.Field()
  pass

middlewares.py  :

# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class JreduSpiderMiddleware(object):
  # Not all methods need to be defined. If a method is not defined,
  # scrapy acts as if the spider middleware does not modify the
  # passed objects.
  @classmethod
  def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s
  def process_spider_input(self, response, spider):
    # Called for each response that goes through the spider
    # middleware and into the spider.
    # Should return None or raise an exception.
    return None
  def process_spider_output(self, response, result, spider):
    # Called with the results returned from the Spider, after
    # it has processed the response.
    # Must return an iterable of Request, dict or Item objects.
    for i in result:
      yield i
  def process_spider_exception(self, response, exception, spider):
    # Called when a spider or process_spider_input() method
    # (from other spider middleware) raises an exception.
    # Should return either None or an iterable of Response, dict
    # or Item objects.
    pass
  def process_start_requests(self, start_requests, spider):
    # Called with the start requests of the spider, and works
    # similarly to the process_spider_output() method, except
    # that it doesn't have a response associated.
    # Must return only requests (not items).
    for r in start_requests:
      yield r
  def spider_opened(self, spider):
    spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py  :

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json
class JreduPipeline(object):
  def __init__(self):
    self.fill = codecs.open("data.txt",encoding="utf-8",mode="wb");
  def process_item(self, item, spider):
    line = json.dumps(dict(item))+"\n"
    self.fill.write(line)
    return item

settings.py   :

# -*- coding: utf-8 -*-
# Scrapy settings for jredu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   http://doc.scrapy.org/en/latest/topics/settings.html
#   http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#   http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jredu'
SPIDER_MODULES = ['jredu.spiders']
NEWSPIDER_MODULE = 'jredu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jredu (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'jredu.middlewares.JreduSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'jredu.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'jredu.pipelines.JreduPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后需要一个程序入口的方法:

main.py     :

#!/usr/bin/python
# -*- coding:utf-8 -*-
#爬虫文件的执行入口
from scrapy import cmdline
cmdline.execute("scrapy crawl tt".split())

更多关于Python相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
Python open读写文件实现脚本
Sep 06 Python
Python中使用md5sum检查目录中相同文件代码分享
Feb 02 Python
简单介绍Python中的round()方法
May 15 Python
Python中实例化class的执行顺序示例详解
Oct 14 Python
python 使用poster模块进行http方式的文件传输到服务器的方法
Jan 15 Python
详解Python logging调用Logger.info方法的处理过程
Feb 12 Python
使用PyQtGraph绘制精美的股票行情K线图的示例代码
Mar 14 Python
python控制台实现tab补全和清屏的例子
Aug 20 Python
复化梯形求积分实例——用Python进行数值计算
Nov 20 Python
Python谱减法语音降噪实例
Dec 18 Python
Python函数的定义方式与函数参数问题实例分析
Dec 26 Python
Python进程间的通信之语法学习
Apr 11 Python
简单了解Pandas缺失值处理方法
Nov 16 #Python
python selenium 执行完毕关闭chromedriver进程示例
Nov 15 #Python
浅谈Django2.0 加xadmin踩的坑
Nov 15 #Python
Django 实现xadmin后台菜单改为中文
Nov 15 #Python
django使用xadmin的全局配置详解
Nov 15 #Python
在django-xadmin中APScheduler的启动初始化实例
Nov 15 #Python
解决django-xadmin列表页filter关联对象搜索问题
Nov 15 #Python
You might like
PHP 文件上传功能实现代码
2009/06/24 PHP
php 网上商城促销设计实例代码
2012/02/17 PHP
php实现的简单日志写入函数
2015/03/31 PHP
PHP目录与文件操作技巧总结(创建,删除,遍历,读写,修改等)
2016/09/11 PHP
laravel实现登录时监听事件,添加登录用户的记录方法
2019/09/30 PHP
jQuery Ajax 全解析
2009/02/08 Javascript
JavaScript使用过程中需要注意的地方和一些基本语法
2010/08/26 Javascript
javascript Window及document对象详细整理
2011/01/12 Javascript
基于jquery的一个拖拽到指定区域内的效果
2011/09/21 Javascript
简单的前端js+ajax 购物车框架(入门篇)
2011/10/29 Javascript
javascript检测对象中是否存在某个属性判断方法小结
2013/05/19 Javascript
JSON中双引号的轮回使用过程中一定要小心
2014/03/05 Javascript
js 获取页面高度和宽度兼容 ie firefox chrome等
2014/05/14 Javascript
jquery操作checked属性以及disabled属性的多种方法
2014/06/20 Javascript
后台获取ZTREE选中节点的方法
2015/02/12 Javascript
javascript遇到html5的一些表单属性
2015/07/05 Javascript
Vuejs第十二篇之动态组件全面解析
2016/09/09 Javascript
javascript实现鼠标点击生成文字特效
2019/12/24 Javascript
浅谈python爬虫使用Selenium模拟浏览器行为
2018/02/23 Python
漂亮的Django Markdown富文本app插件的实现
2019/01/02 Python
python opencv实现图像边缘检测
2019/04/29 Python
django 类视图的使用方法详解
2019/07/24 Python
tensorflow 保存模型和取出中间权重例子
2020/01/24 Python
python 使用递归回溯完美解决八皇后的问题
2020/02/26 Python
python批量修改文件名的示例
2020/09/27 Python
Python os库常用操作代码汇总
2020/11/03 Python
关于HTML5 Placeholder新标签低版本浏览器下不兼容的问题分析及解决办法
2016/01/27 HTML / CSS
精美的手工家居和生活用品:Nkuku
2019/11/01 全球购物
中学生校园广播稿
2014/01/16 职场文书
加多宝凉茶广告词
2014/03/18 职场文书
体育课课后反思
2014/04/24 职场文书
安全生产承诺书范文
2014/05/22 职场文书
信电学院毕业生自荐书
2014/05/24 职场文书
OpenCV3.3+Python3.6实现图片高斯模糊
2021/05/18 Python
html2 canvas svg不能识别的解决方案
2021/06/03 HTML / CSS
Python干货实战之八音符酱小游戏全过程详解
2021/10/24 Python