编程 Python

scrapy spider的几种爬取方式实例代码

Posted in Python onJanuary 25, 2018

本节课介绍了scrapy的爬虫框架，重点说了scrapy组件spider。

spider的几种爬取方式：

爬取1页内容
按照给定列表拼出链接爬取多页
找到‘下一页'标签进行爬取
进入链接，按照链接进行爬取

下面分别给出了示例

1.爬取1页内容

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy


class JulyeduSpider(scrapy.Spider):
  name = "julyedu"
  start_urls = [
    'https://www.julyedu.com/category/index',
  ]

  def parse(self, response):
    for julyedu_class in response.xpath('//div[@class="course_info_box"]'):
      print julyedu_class.xpath('a/h4/text()').extract_first()
      print julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first()
      print julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first()
      print response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())
      print "\n"

      yield {
        'title':julyedu_class.xpath('a/h4/text()').extract_first(),
        'desc': julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first(),
        'time': julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first(),
        'img_url': response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())
      }

2.按照给定列表拼出链接爬取多页

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy


class CnBlogSpider(scrapy.Spider):
  name = "cnblogs"
  allowed_domains = ["cnblogs.com"]
  start_urls = [
    'http://www.cnblogs.com/pick/#p%s' % p for p in xrange(1, 11)
    ]

  def parse(self, response):
    for article in response.xpath('//div[@class="post_item"]'):
      print article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip()
      print response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip()
      print article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip()
      print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip()
      print response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip()
      print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip()
      print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip()
      print ""

      yield {
        'title': article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip(),
        'link': response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip(),
        'summary': article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip(),
        'author': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip(),
        'author_link': response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip(),
        'comment': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip(),
        'view': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip(),
      }

3.找到‘下一页'标签进行爬取

import scrapy
class QuotesSpider(scrapy.Spider):
  name = "quotes"
  start_urls = [
    'http://quotes.toscrape.com/tag/humor/',
  ]

  def parse(self, response):
    for quote in response.xpath('//div[@class="quote"]'):
      yield {
        'text': quote.xpath('span[@class="text"]/text()').extract_first(),
        'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),
      }

    next_page = response.xpath('//li[@class="next"]/@herf').extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback=self.parse)

4.进入链接，按照链接进行爬取

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy


class QQNewsSpider(scrapy.Spider):
  name = 'qqnews'
  start_urls = ['http://news.qq.com/society_index.shtml']

  def parse(self, response):
    for href in response.xpath('//*[@id="news"]/div/div/div/div/em/a/@href'):
      full_url = response.urljoin(href.extract())
      yield scrapy.Request(full_url, callback=self.parse_question)

  def parse_question(self, response):
    print response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first()
    print response.xpath('//span[@class="a_time"]/text()').extract_first()
    print response.xpath('//span[@class="a_catalog"]/a/text()').extract_first()
    print "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())
    print ""
    yield {
      'title': response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first(),
      'content': "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract()),
      'time': response.xpath('//span[@class="a_time"]/text()').extract_first(),
      'cate': response.xpath('//span[@class="a_catalog"]/a/text()').extract_first(),
    }

总结

以上就是本文关于scrapy spider的几种爬取方式实例代码的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

scrapy spider的几种爬取方式实例代码

- Author -

NodYoung

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

pycharm 使用心得（五）断点调试

Jun 06 Python

Python中使用ConfigParser解析ini配置文件实例

Aug 30 Python

Django的数据模型访问多对多键值的方法

Jul 21 Python

Python工程师面试题与Python基础语法相关

Jan 14 Python

举例讲解Python常用模块

Mar 08 Python

Django中reverse反转并且传递参数的方法

Aug 06 Python

用python介绍4种常用的单链表翻转的方法小结

Feb 24 Python

基于python 等频分箱qcut问题的解决

Mar 03 Python

在Matplotlib图中插入LaTex公式实例

Apr 17 Python

使用python实现名片管理系统

Jun 18 Python

Python爬虫爬取糗事百科段子实例分享

Jul 31 Python

Pytorch中expand()的使用(扩展某个维度)

Jul 15 Python

scrapy爬虫完整实例

Jan 25 #Python

python实现画圆功能

Jan 25 #Python

Python中常用信号signal类型实例

Jan 25 #Python

简单实现python画圆功能

Jan 25 #Python

Python中sort和sorted函数代码解析

Jan 25 #Python

django在接受post请求时显示403forbidden实例解析

Jan 25 #Python

Python微信公众号开发平台

Jan 25 #Python

You might like

PHP Smarty生成EXCEL文档的代码

2008/08/23 PHP

php debug 安装技巧

2011/04/30 PHP

php中判断数组是一维,二维,还是多维的解决方法

2013/05/04 PHP

PHP基于单例模式实现的mysql类

2016/01/09 PHP

Symfony2联合查询实现方法

2016/03/18 PHP

Yii2框架配置文件(Application属性)与调试技巧实例分析

2019/05/27 PHP

YUI Compressor压缩JavaScript原理及微优化

2013/01/07 Javascript

JQUERY 获取IFrame中对象及获取其父窗口中对象示例

2013/08/19 Javascript

jQuery中:lt选择器用法实例

2014/12/29 Javascript

jquery结合CSS使用validate实现漂亮的验证

2015/01/29 Javascript

全面了解js中的script标签

2016/07/04 Javascript

JavaScript每天必学之事件

2016/09/18 Javascript

基于Jquery Ajax type的4种类型(详解)

2017/08/02 jQuery

angularjs实现猜数字大小功能

2020/05/20 Javascript

AngularJS模糊查询功能实现代码(过滤内容下拉菜单排序过滤敏感字符验证判断后添加表格信息)

2017/10/24 Javascript

jquery 输入框查找关键字并提亮颜色的实例代码

2018/01/23 jQuery

jQuery实时统计输入框字数及限制

2020/06/24 jQuery

[46:14]VGJ.T vs Liquid 2018国际邀请赛小组赛BO2 第一场 8.19

2018/08/21 DOTA

[54:57]DOTA2-DPC中国联赛定级赛 Aster vs DLG BO3第二场 1月8日

2021/03/11 DOTA

python 图片验证码代码

2008/12/07 Python

python3使用urllib示例取googletranslate(谷歌翻译)

2014/01/23 Python

Django 2.0版本的新特性抢先看！

2018/01/05 Python

python topN 取最大的N个数或最小的N个数方法

2018/06/04 Python

python retrying模块的使用方法详解

2019/09/25 Python

python实现图片二值化及灰度处理方式

2019/12/07 Python

CSS3中Color的一些特性介绍

2012/05/27 HTML / CSS

什么情况下你必须要把一个类定义为abstract的

2013/01/06 面试题

什么时候用assert

2015/05/08 面试题

副总经理工作职责

2013/11/28 职场文书

八一建军节活动方案

2014/02/10 职场文书

企业安全生产责任书范本

2014/07/28 职场文书

2015年建筑工程工作总结

2015/05/13 职场文书

工作报告范文

2019/06/20 职场文书

sql通过日期判断年龄函数的示例代码

2021/07/16 SQL Server

SpringBoot实现quartz定时任务可视化管理功能

2021/08/30 Java/Android

CSS 实现磨砂玻璃(毛玻璃)效果样式

2023/05/21 HTML / CSS