编程 Python

python爬虫爬取淘宝商品信息（selenum+phontomjs）

Posted in Python onFebruary 24, 2018

本文实例为大家分享了python爬虫爬取淘宝商品的具体代码，供大家参考，具体内容如下

1、需求目标 ：

进去淘宝页面，搜索耐克关键词，抓取商品的标题，链接，价格，城市，旺旺号，付款人数，进去第二层，抓取商品的销售量，款号等。

2、结果展示

python爬虫爬取淘宝商品信息（selenum+phontomjs）

3、源代码

# encoding: utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import time
import pandas as pd
time1=time.time()
from lxml import etree
from selenium import webdriver
#########自动模拟
driver=webdriver.PhantomJS(executable_path='D:/Python27/Scripts/phantomjs.exe')
import re

#################定义列表存储#############
title=[]
price=[]
city=[]
shop_name=[]
num=[]
link=[]
sale=[]
number=[]

#####输入关键词耐克(这里必须用unicode)
keyword="%E8%80%90%E5%85%8B"


for i in range(0,1):

  try:
    print "...............正在抓取第"+str(i)+"页..........................."

    url="https://s.taobao.com/search?q=%E8%80%90%E5%85%8B&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20170710&ie=utf8&bcoffset=4&ntoffset=4&p4ppushleft=1%2C48&s="+str(i*44)
    driver.get(url)
    time.sleep(5)
    html=driver.page_source

    selector=etree.HTML(html)
    title1=selector.xpath('//div[@class="row row-2 title"]/a')
    for each in title1:
      print each.xpath('string(.)').strip()
      title.append(each.xpath('string(.)').strip())


    price1=selector.xpath('//div[@class="price g_price g_price-highlight"]/strong/text()')
    for each in price1:
      print each
      price.append(each)


    city1=selector.xpath('//div[@class="location"]/text()')
    for each in city1:
      print each
      city.append(each)


    num1=selector.xpath('//div[@class="deal-cnt"]/text()')
    for each in num1:
      print each
      num.append(each)


    shop_name1=selector.xpath('//div[@class="shop"]/a/span[2]/text()')
    for each in shop_name1:
      print each
      shop_name.append(each)


    link1=selector.xpath('//div[@class="row row-2 title"]/a/@href')
    for each in link1:
      kk="https://" + each


      link.append("https://" + each)
      if "https" in each:
        print each

        driver.get(each)
      else:
        print "https://" + each
        driver.get("https://" + each)
      time.sleep(3)
      html2=driver.page_source
      selector2=etree.HTML(html2)

      sale1=selector2.xpath('//*[@id="J_DetailMeta"]/div[1]/div[1]/div/ul/li[1]/div/span[2]/text()')
      for each in sale1:
        print each
        sale.append(each)

      sale2=selector2.xpath('//strong[@id="J_SellCounter"]/text()')
      for each in sale2:
        print each
        sale.append(each)

      if "tmall" in kk:
        number1 = re.findall('<ul id="J_AttrUL">(.*?)</ul>', html2, re.S)
        for each in number1:
          m = re.findall('>*号: (.*?)</li>', str(each).strip(), re.S)
          if len(m) > 0:
            for each1 in m:
              print each1
              number.append(each1)

          else:
            number.append("NULL")

      if "taobao" in kk:
        number2=re.findall('<ul class="attributes-list">(.*?)</ul>',html2,re.S)
        for each in number2:
          h=re.findall('>*号: (.*?)</li>', str(each).strip(), re.S)
          if len(m) > 0:
            for each2 in h:
              print each2
              number.append(each2)

          else:
            number.append("NULL")

      if "click" in kk:
        number.append("NULL")

  except:
    pass


print len(title),len(city),len(price),len(num),len(shop_name),len(link),len(sale),len(number)

# #
# ######数据框
data1=pd.DataFrame({"标题":title,"价格":price,"旺旺":shop_name,"城市":city,"付款人数":num,"链接":link,"销量":sale,"款号":number})
print data1
# 写出excel
writer = pd.ExcelWriter(r'C:\\taobao_spider2.xlsx', engine='xlsxwriter', options={'strings_to_urls': False})
data1.to_excel(writer, index=False)
writer.close()

time2 = time.time()
print u'ok,爬虫结束!'
print u'总共耗时：' + str(time2 - time1) + 's'
####关闭浏览器
driver.close()

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python爬虫爬取淘宝商品信息（selenum+phontomjs）

- Author -

开心果汁

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python3使用requests登录人人影视网站的方法

May 11 Python

Python基于pycrypto实现的AES加密和解密算法示例

Apr 10 Python

dataframe设置两个条件取值的实例

Apr 12 Python

Python面向对象程序设计之继承与多继承用法分析

Jul 13 Python

对Python3中bytes和HexStr之间的转换详解

Dec 04 Python

python的faker库用法

Nov 28 Python

基于python判断目录或者文件代码实例

Nov 29 Python

pytorch GAN伪造手写体mnist数据集方式

Jan 10 Python

Centos7下源码安装Python3 及shell 脚本自动安装Python3的教程

Mar 07 Python

Python命名空间namespace及作用域原理解析

Jun 05 Python

Python 为什么推荐蛇形命名法原因浅析

Jun 18 Python

基于Python的EasyGUI学习实践

May 07 Python

python正则表达式爬取猫眼电影top100

Feb 24 #Python

python爬虫获取淘宝天猫商品详细参数

Jun 23 #Python

python按综合、销量排序抓取100页的淘宝商品列表信息

Feb 24 #Python

python2.7+selenium2实现淘宝滑块自动认证功能

Feb 24 #Python

Python 中Pickle库的使用详解

Feb 24 #Python

Python使用Selenium+BeautifulSoup爬取淘宝搜索页

Feb 24 #Python

python3+mysql查询数据并通过邮件群发excel附件

Feb 24 #Python

You might like

jquery.validate使用攻略第一部

2010/07/01 Javascript

JavaScript 获取当前时间戳的代码

2010/08/05 Javascript

也说JavaScript中String类的replace函数

2011/09/22 Javascript

js实现文本框中焦点在最后位置

2014/03/04 Javascript

js判断元素是否隐藏的方法

2014/06/09 Javascript

jQuery获得字体颜色16位码的方法

2016/02/20 Javascript

点击按钮出现60秒倒计时的简单js代码(推荐)

2016/06/07 Javascript

详解AngularJS验证、过滤器、指令

2017/01/04 Javascript

JS常见简单正则表达式验证功能小结【手机,地址,企业税号,金额,身份证等】

2017/01/22 Javascript

Vue下的国际化处理方法

2017/12/18 Javascript

jquery实现点击a链接,跳转之后,该a链接处显示背景色的方法

2018/01/18 jQuery

详解Vue单元测试case写法

2018/05/24 Javascript

详解如何在webpack中做预渲染降低首屏空白时间

2018/08/22 Javascript

Vue项目查看当前使用的elementUI版本的方法

2018/09/27 Javascript

用jQuery实现抽奖程序

2020/04/12 jQuery

weui上传多图片,压缩,base64编码的示例代码

2020/06/22 Javascript

[08:40]Navi Vs Newbee

2018/06/07 DOTA

win系统下为Python3.5安装flask-mongoengine 库

2016/12/20 Python

python rsa 加密解密

2017/03/20 Python

使用Django Form解决表单数据无法动态刷新的两种方法

2017/07/14 Python

python数字图像处理之高级形态学处理

2018/04/27 Python

Python 类的特殊成员解析

2018/06/20 Python

Pytorch 实现冻结指定卷积层的参数

2020/01/06 Python

Django 设置admin后台表和App(应用)为中文名的操作方法

2020/05/10 Python

python如何设置静态变量

2020/09/07 Python

瑞典最大的儿童用品网上商店：pinkorblue.se

2021/03/09 全球购物

既然说Ruby中一切都是对象，那么Ruby中类也是对象吗

2013/01/26 面试题

行政人员工作职责

2013/12/05 职场文书

主题婚礼策划方案

2014/02/10 职场文书

《藤野先生》教学反思

2014/02/19 职场文书

2014年政教处工作总结

2014/12/20 职场文书

2015年房地产销售工作总结

2015/04/20 职场文书

HTML5页面音频自动播放的实现方式

2021/06/21 HTML / CSS

OpenCV图像变换之傅里叶变换的一些应用

2021/07/26 Python

深入理解go缓存库freecache的使用

2022/02/15 Golang

vue中使用mockjs配置和使用方式

2022/04/06 Vue.js