编程 Python

Python下使用Scrapy爬取网页内容的实例

Posted in Python onMay 21, 2018

上周用了一周的时间学习了Python和Scrapy，实现了从0到1完整的网页爬虫实现。研究的时候很痛苦，但是很享受，做技术的嘛。

首先，安装Python，坑太多了，一个个爬。由于我是windows环境，没钱买mac, 在安装的时候遇到各种各样的问题，确实各种各样的依赖。

安装教程不再赘述。如果在安装的过程中遇到 ERROR：需要windows c/c++问题，一般是由于缺少windows开发编译环境，晚上大多数教程是安装一个VisualStudio，太不靠谱了，事实上只要安装一个WindowsSDK就可以了。

下面贴上我的爬虫代码：

爬虫主程序：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from zjf.FsmzItems import FsmzItem
from scrapy.selector import Selector
# 圈圈：情感生活
class MySpider(scrapy.Spider):
 #爬虫名
 name = "MySpider"
 #设定域名
 allowed_domains = ["nvsheng.com"]
 #爬取地址
 start_urls = []
 #flag
 x = 0
 #爬取方法
 def parse(self, response):
  item = FsmzItem()
  sel = Selector(response)
  item['title'] = sel.xpath('//h1/text()').extract()
  item['text'] = sel.xpath('//*[@class="content"]/p/text()').extract()
  item['imags'] = sel.xpath('//div[@id="content"]/p/a/img/@src|//div[@id="content"]/p/img/@src').extract()
  if MySpider.x == 0:
   page_list = MySpider.getUrl(self,response)
   for page_single in page_list:
    yield Request(page_single)
  MySpider.x += 1
  yield item
 #init: 动态传入参数
 #命令行传参写法： scrapy crawl MySpider -a start_url="http://some_url"
 def __init__(self,*args,**kwargs):
  super(MySpider,self).__init__(*args,**kwargs)
  self.start_urls = [kwargs.get('start_url')]
 def getUrl(self, response):
  url_list = []
  select = Selector(response)
  page_list_tmp = select.xpath('//div[@class="viewnewpages"]/a[not(@class="next")]/@href').extract()
  for page_tmp in page_list_tmp:
   if page_tmp not in url_list:
    url_list.append("http://www.nvsheng.com/emotion/px/" + page_tmp)
  return url_list

PipeLines类

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from zjf import settings
import json,os,re,random
import urllib.request
import requests, json
from requests_toolbelt.multipart.encoder import MultipartEncoder
class MyPipeline(object):
 flag = 1
 post_title = ''
 post_text = []
 post_text_imageUrl_list = []
 cs = []
 user_id= ''
 def __init__(self):
  MyPipeline.user_id = MyPipeline.getRandomUser('37619,18441390,18441391')
 #process the data
 def process_item(self, item, spider):
  #获取随机user_id，模拟发帖
  user_id = MyPipeline.user_id
  #获取正文text_str_tmp
  text = item['text']
  text_str_tmp = ""
  for str in text:
   text_str_tmp = text_str_tmp + str
  # print(text_str_tmp)
  #获取标题
  if MyPipeline.flag == 1:
   title = item['title']
   MyPipeline.post_title = MyPipeline.post_title + title[0]
  #保存并上传图片
  text_insert_pic = ''
  text_insert_pic_w = ''
  text_insert_pic_h = ''
  for imag_url in item['imags']:
   img_name = imag_url.replace('/','').replace('.','').replace('|','').replace(':','')
   pic_dir = settings.IMAGES_STORE + '%s.jpg' %(img_name)
   urllib.request.urlretrieve(imag_url,pic_dir)
   #图片上传，返回json
   upload_img_result = MyPipeline.uploadImage(pic_dir,'image/jpeg')
   #获取json中保存图片路径
   text_insert_pic = upload_img_result['result']['image_url']
   text_insert_pic_w = upload_img_result['result']['w']
   text_insert_pic_h = upload_img_result['result']['h']
  #拼接json
  if MyPipeline.flag == 1:
   cs_json = {"c":text_str_tmp,"i":"","w":text_insert_pic_w,"h":text_insert_pic_h}
  else:
   cs_json = {"c":text_str_tmp,"i":text_insert_pic,"w":text_insert_pic_w,"h":text_insert_pic_h}
  MyPipeline.cs.append(cs_json)
  MyPipeline.flag += 1
  return item
 #spider开启时被调用
 def open_spider(self,spider):
  pass
 #sipder 关闭时被调用
 def close_spider(self,spider):
  strcs = json.dumps(MyPipeline.cs)
  jsonData = {"apisign":"99ea3eda4b45549162c4a741d58baa60","user_id":MyPipeline.user_id,"gid":30,"t":MyPipeline.post_title,"cs":strcs}
  MyPipeline.uploadPost(jsonData)
 #上传图片
 def uploadImage(img_path,content_type):
  "uploadImage functions"
  #UPLOAD_IMG_URL = "http://api.qa.douguo.net/robot/uploadpostimage"
  UPLOAD_IMG_URL = "http://api.douguo.net/robot/uploadpostimage"
  # 传图片
  #imgPath = 'D:\pics\http___img_nvsheng_com_uploads_allimg_170119_18-1f1191g440_jpg.jpg'
  m = MultipartEncoder(
   # fields={'user_id': '192323',
   #   'images': ('filename', open(imgPath, 'rb'), 'image/JPEG')}
   fields={'user_id': MyPipeline.user_id,
     'apisign':'99ea3eda4b45549162c4a741d58baa60',
     'image': ('filename', open(img_path , 'rb'),'image/jpeg')}
  )
  r = requests.post(UPLOAD_IMG_URL,data=m,headers={'Content-Type': m.content_type})
  return r.json()
 def uploadPost(jsonData):
  CREATE_POST_URL = http://api.douguo.net/robot/uploadimagespost

reqPost = requests.post(CREATE_POST_URL,data=jsonData)

def getRandomUser(userStr):
  user_list = []
  user_chooesd = ''
  for user_id in str(userStr).split(','):
   user_list.append(user_id)
  userId_idx = random.randint(1,len(user_list))
  user_chooesd = user_list[userId_idx-1]
  return user_chooesd

字段保存Items类

# -*- coding: utf-8 -*- 
 
# Define here the models for your scraped items 
# 
# See documentation in: 
# http://doc.scrapy.org/en/latest/topics/items.html 
 
import scrapy 
 
class FsmzItem(scrapy.Item): 
 # define the fields for your item here like: 
 # name = scrapy.Field() 
 title = scrapy.Field() 
 #tutor = scrapy.Field() 
 #strongText = scrapy.Field() 
 text = scrapy.Field() 
 imags = scrapy.Field()

在命令行里键入

scrapy crawl MySpider -a start_url=www.aaa.com

这样就可以爬取aaa.com下的内容了

以上这篇Python下使用Scrapy爬取网页内容的实例就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持三水点靠木。

Python下使用Scrapy爬取网页内容的实例

- Author -

止鱼

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

py2exe 编译ico图标的代码

Mar 08 Python

python实现决策树

Dec 21 Python

python实现定时提取实时日志程序

Jun 22 Python

Python二进制串转换为通用字符串的方法

Jul 23 Python

python在html中插入简单的代码并加上时间戳的方法

Oct 16 Python

Python异常处理例题整理

Jul 07 Python

pycharm新建一个python工程步骤

Jul 16 Python

python读写csv文件的方法

Aug 13 Python

对Django中内置的User模型实例详解

Aug 16 Python

python实现图片插入文字

Nov 26 Python

python GUI库图形界面开发之PyQt5布局控件QVBoxLayout详细使用方法与实例

Mar 06 Python

Django 解决由save方法引发的错误

May 21 Python

python 每天如何定时启动爬虫任务(实现方法分享)

May 21 #Python

对python抓取需要登录网站数据的方法详解

May 21 #Python

深入浅析python 中的匿名函数

May 21 #Python

python3 selenium 切换窗口的几种方法小结

May 21 #Python

python selenium 对浏览器标签页进行关闭和切换的方法

May 21 #Python

pytorch cnn 识别手写的字实现自建图片数据

May 20 #Python

pytorch 把MNIST数据集转换成图片和txt的方法

May 20 #Python

You might like

解析php时间戳与日期的转换

2013/06/06 PHP

浅析Yii中使用RBAC的完全指南(用户角色权限控制)

2013/06/20 PHP

用PHP和Shell写Hadoop的MapReduce程序

2014/04/15 PHP

php进程(线程)通信基础之System V共享内存简单实例分析

2019/11/09 PHP

CSS中一些@规则的用法小结

2021/03/09 HTML / CSS

Mootools 图片展示插件(lightbox,ImageMenu)收集集合

2010/05/21 Javascript

JSDoc 介绍使用规范JsDoc的使用介绍

2011/02/12 Javascript

一款Jquery 分页插件的改造方法（服务器端分页）

2011/07/11 Javascript

JS和Jquery获取和修改label的值的示例代码

2014/01/15 Javascript

javascript将数字转换整数金额大写的方法

2015/01/27 Javascript

jQuery检测输入的字符串包含的中英文的数量

2015/04/17 Javascript

jQuery中常用的遍历函数用法实例总结

2015/09/01 Javascript

举例讲解JavaScript substring()的使用方法

2015/11/09 Javascript

轻松掌握JavaScript中的Math object数学对象

2016/05/26 Javascript

JavaScript中误用/g导致的正则test()无法正确重复执行的解决方案

2016/07/27 Javascript

适用于手机端的jQuery图片滑块动画

2016/12/09 Javascript

Angular实现预加载延迟模块的示例

2017/10/12 Javascript

原生js实现购物车

2020/09/23 Javascript

[01:12:53]完美世界DOTA2联赛PWL S2 Forest vs SZ 第一场 11.25

2020/11/26 DOTA

[48:31]完美世界DOTA2联赛PWL S3 DLG vs Phoenix 第二场 12.17

2020/12/19 DOTA

Python面向对象编程基础解析（一）

2017/10/26 Python

Python OpenCV 直方图的计算与显示的方法示例

2018/02/08 Python

python 实现分页显示从es中获取的数据方法

2018/12/26 Python

python获取当前文件路径以及父文件路径的方法

2019/07/10 Python

Python爬虫爬取Bilibili弹幕过程解析

2019/10/10 Python

pycharm通过anaconda安装pyqt5的教程

2020/03/24 Python

将pymysql获取到的数据类型是tuple转化为pandas方式

2020/05/15 Python

利用python进行文件操作

2020/12/04 Python

python上下文管理的使用场景实例讲解

2021/03/03 Python

Levi’s美国官网：美国著名的牛仔裤品牌

2016/08/19 全球购物

师范应届生求职信

2013/11/15 职场文书

搬迁通知

2015/04/20 职场文书

清明扫墓感想

2015/08/11 职场文书

Mac环境Nginx配置和访问本地静态资源的实现

2021/03/31 Servers

Spark SQL 2.4.8 操作 Dataframe的两种方式

2021/10/16 SQL Server

Win11 25163.1010更新补丁KB5016904推送，测试服务验证管道(附更新修复汇总)

2022/07/23 数码科技