编程 Python

python手机号前7位归属地爬虫代码实例

Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位，判断号码是否合法，还有归属地查询。旧的数据是几年前了太久了，打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码，后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信，联通，移动，虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段，共10000次查询，单线程版大概要多1个半小时，太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据，大概2，3min就好，cpu使用飙升，大概维持在70%左右。

总共40多个号段，爬完大概1，2个小时，总数据41w左右

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python手机号前7位归属地爬虫代码实例

- Author -

wanli001

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python3里的super()和__class__使用介绍

Apr 23 Python

在 Python 应用中使用 MongoDB的方法

Jan 05 Python

Python绑定方法与非绑定方法详解

Aug 18 Python

Python3实现的字典、列表和json对象互转功能示例

May 22 Python

Python全局变量与局部变量区别及用法分析

Sep 03 Python

Python操作远程服务器 paramiko模块详细介绍

Aug 07 Python

Python3 venv搭建轻量级虚拟环境的步骤(图文)

Aug 09 Python

python快速排序的实现及运行时间比较

Nov 22 Python

python 实现二维字典的键值合并等函数

Dec 06 Python

Python3如何在Windows和Linux上打包

Feb 25 Python

宝塔面板成功部署Django项目流程(图文)

Jun 22 Python

Python 在局部变量域中执行代码

Aug 07 Python

django修改models重建数据库的操作

Mar 31 #Python

Python写捕鱼达人的游戏实现

Mar 31 #Python

Django 多对多字段的更新和插入数据实例

Mar 31 #Python

基于python爬取有道翻译过程图解

Mar 31 #Python

django实现将修改好的新模型写入数据库

Mar 31 #Python

Python urlencode和unquote函数使用实例解析

Mar 31 #Python

Python响应对象text属性乱码解决方案

Mar 31 #Python

You might like

Zerg基本策略

2020/03/14 星际争霸

php 正确解码javascript中通过escape编码后的字符

2010/01/28 PHP

PHP单元测试框架PHPUnit用法详解

2019/01/23 PHP

php微信扫码支付 php公众号支付

2019/03/24 PHP

JS 控制非法字符的输入代码

2009/12/04 Javascript

Chrome中JSON.parse的特殊实现

2011/01/12 Javascript

终于解决了IE8不支持数组的indexOf方法

2013/04/03 Javascript

jquery实现动态画圆

2014/12/04 Javascript

jquery插件orbit.js实现图片折叠轮换特效

2015/04/14 Javascript

jQuery的css() 方法使用指南

2015/05/03 Javascript

jquery.Callbacks的实现详解

2016/11/30 Javascript

HTML5+Canvas调用手机拍照功能实现图片上传（下）

2017/04/21 Javascript

jquery实现图片上传前本地预览

2017/04/28 jQuery

使用ECharts实现状态区间图

2018/10/25 Javascript

微信小程序实现弹出菜单动画

2019/06/21 Javascript

vue图片加载失败时用默认图片替换的方法

2019/08/29 Javascript

windows实现npm和cnpm安装步骤

2019/10/24 Javascript

JS浏览器BOM常见操作实例详解

2020/04/27 Javascript

Nodejs环境实现socket通信过程解析

2020/07/03 NodeJs

JavaScript this指向相关原理及实例解析

2020/07/10 Javascript

[01:04:20]完美世界DOTA2联赛PWL S2 LBZS vs Forest 第一场 11.29

2020/12/02 DOTA

Python批量修改文本文件内容的方法

2016/04/29 Python

python验证码识别的示例代码

2017/09/21 Python

读取json格式为DataFrame(可转为.csv)的实例讲解

2018/06/05 Python

NumPy.npy与pandas DataFrame的实例讲解

2018/07/09 Python

详解python解压压缩包的五种方法

2019/07/05 Python

Python多进程编程multiprocessing代码实例

2020/03/12 Python

Django-simple-captcha验证码包使用方法详解

2020/11/28 Python

迪卡侬(Decathlon)加拿大官网：源自法国的运动专业超市

2020/11/22 全球购物

应用化学专业本科生求职信

2013/09/29 职场文书

商务日语毕业生自荐信

2013/11/23 职场文书

船舶工程技术专业求职信

2014/08/07 职场文书

2014领导干部四风问题查摆思想汇报

2014/09/13 职场文书

一年级班主任工作总结2014

2014/11/08 职场文书

海底两万里读书笔记

2015/06/26 职场文书

三八妇女节新闻稿

2015/07/17 职场文书