python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python3里的super()和__class__使用介绍
Apr 23 Python
在 Python 应用中使用 MongoDB的方法
Jan 05 Python
Python绑定方法与非绑定方法详解
Aug 18 Python
Python3实现的字典、列表和json对象互转功能示例
May 22 Python
Python全局变量与局部变量区别及用法分析
Sep 03 Python
Python操作远程服务器 paramiko模块详细介绍
Aug 07 Python
Python3 venv搭建轻量级虚拟环境的步骤(图文)
Aug 09 Python
python快速排序的实现及运行时间比较
Nov 22 Python
python 实现二维字典的键值合并等函数
Dec 06 Python
Python3如何在Windows和Linux上打包
Feb 25 Python
宝塔面板成功部署Django项目流程(图文)
Jun 22 Python
Python 在局部变量域中执行代码
Aug 07 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
Zerg基本策略
2020/03/14 星际争霸
php 正确解码javascript中通过escape编码后的字符
2010/01/28 PHP
PHP单元测试框架PHPUnit用法详解
2019/01/23 PHP
php微信扫码支付 php公众号支付
2019/03/24 PHP
JS 控制非法字符的输入代码
2009/12/04 Javascript
Chrome中JSON.parse的特殊实现
2011/01/12 Javascript
终于解决了IE8不支持数组的indexOf方法
2013/04/03 Javascript
jquery实现动态画圆
2014/12/04 Javascript
jquery插件orbit.js实现图片折叠轮换特效
2015/04/14 Javascript
jQuery的css() 方法使用指南
2015/05/03 Javascript
jquery.Callbacks的实现详解
2016/11/30 Javascript
HTML5+Canvas调用手机拍照功能实现图片上传(下)
2017/04/21 Javascript
jquery实现图片上传前本地预览
2017/04/28 jQuery
使用ECharts实现状态区间图
2018/10/25 Javascript
微信小程序实现弹出菜单动画
2019/06/21 Javascript
vue图片加载失败时用默认图片替换的方法
2019/08/29 Javascript
windows实现npm和cnpm安装步骤
2019/10/24 Javascript
JS浏览器BOM常见操作实例详解
2020/04/27 Javascript
Nodejs环境实现socket通信过程解析
2020/07/03 NodeJs
JavaScript this指向相关原理及实例解析
2020/07/10 Javascript
[01:04:20]完美世界DOTA2联赛PWL S2 LBZS vs Forest 第一场 11.29
2020/12/02 DOTA
Python批量修改文本文件内容的方法
2016/04/29 Python
python验证码识别的示例代码
2017/09/21 Python
读取json格式为DataFrame(可转为.csv)的实例讲解
2018/06/05 Python
NumPy.npy与pandas DataFrame的实例讲解
2018/07/09 Python
详解python解压压缩包的五种方法
2019/07/05 Python
Python多进程编程multiprocessing代码实例
2020/03/12 Python
Django-simple-captcha验证码包使用方法详解
2020/11/28 Python
迪卡侬(Decathlon)加拿大官网:源自法国的运动专业超市
2020/11/22 全球购物
应用化学专业本科生求职信
2013/09/29 职场文书
商务日语毕业生自荐信
2013/11/23 职场文书
船舶工程技术专业求职信
2014/08/07 职场文书
2014领导干部四风问题查摆思想汇报
2014/09/13 职场文书
一年级班主任工作总结2014
2014/11/08 职场文书
海底两万里读书笔记
2015/06/26 职场文书
三八妇女节新闻稿
2015/07/17 职场文书