python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python 检查数组元素是否存在类似PHP isset()方法
Oct 14 Python
Flask框架的学习指南之制作简单blog系统
Nov 20 Python
Python安装Numpy和matplotlib的方法(推荐)
Nov 02 Python
简单的python协同过滤程序实例代码
Jan 31 Python
Python3 获取一大段文本之间两个关键字之间的内容方法
Oct 11 Python
[原创]Python入门教程2. 字符串基本操作【运算、格式化输出、常用函数】
Oct 29 Python
浅谈Python中的可迭代对象、迭代器、For循环工作机制、生成器
Mar 11 Python
python 多维高斯分布数据生成方式
Dec 09 Python
解决django无法访问本地static文件(js,css,img)网页里js,cs都加载不了
Apr 07 Python
pycharm 关掉syntax检查操作
Jun 09 Python
Python selenium键盘鼠标事件实现过程详解
Jul 28 Python
Python2及Python3如何实现兼容切换
Sep 01 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
apache rewrite_module模块使用教程
2008/01/10 PHP
php防注入,表单提交值转义的实现详解
2013/06/10 PHP
php与flash as3 socket通信传送文件实现代码
2014/08/16 PHP
PHP记录搜索引擎蜘蛛访问网站足迹的方法
2015/04/15 PHP
PHP实现在对象之外访问其私有属性private及保护属性protected的方法
2017/11/20 PHP
Laravel使用消息队列需要注意的一些问题
2017/12/13 PHP
JS支持带x身份证号码验证函数
2008/08/10 Javascript
firefox事件处理之自动查找event的函数(用于onclick=foo())
2010/08/05 Javascript
再论Javascript的类继承
2011/03/05 Javascript
JQUERY 实现窗口滚动搜索框停靠效果(类似滚动停靠)
2013/03/27 Javascript
一个网页标题title的闪动提示效果实现思路
2014/03/22 Javascript
jquery操作复选框checkbox的方法汇总
2015/02/05 Javascript
详解JavaScript中常用的函数类型
2015/11/18 Javascript
AngularJS中的Directive实现延迟加载
2016/01/25 Javascript
jquery实现全选和全不选功能效果的实现代码【推荐】
2016/05/05 Javascript
JS关闭窗口时产生的事件及用法示例
2016/08/20 Javascript
form+iframe解决跨域上传文件的方法
2016/11/18 Javascript
Bootstrap源码解读表单(2)
2016/12/22 Javascript
微信小程序教程系列之设置标题栏和导航栏(7)
2020/06/29 Javascript
关于在vue-cli中使用微信自动登录和分享的实例
2017/06/22 Javascript
vue插件实现v-model功能
2018/09/10 Javascript
Vue实现base64编码图片间的切换功能
2019/12/04 Javascript
vue实现输入框自动跳转功能
2020/05/20 Javascript
[15:15]教你分分钟做大人:狙击手
2014/10/30 DOTA
python解析json串与正则匹配对比方法
2018/12/20 Python
Python+OpenCV+图片旋转并用原底色填充新四角的例子
2019/12/12 Python
Python loguru日志库之高效输出控制台日志和日志记录
2020/03/07 Python
Python爬虫破解登陆哔哩哔哩的方法
2020/11/17 Python
DJI美国:消费类无人机领域的领导者
2018/04/27 全球购物
FILA德国官方网站:来自意大利的体育和街头服饰品牌
2019/07/19 全球购物
传播学毕业生求职信
2013/10/11 职场文书
实用的简历自我评价
2014/03/06 职场文书
yy婚礼主持词
2014/03/14 职场文书
毕业晚宴祝酒词
2015/08/11 职场文书
《敬重卑微》读后感3篇
2019/11/26 职场文书
Node-Red实现MySQL数据库连接的方法
2021/08/07 MySQL