python实现知乎高颜值图片爬取


Posted in Python onAugust 12, 2019

导入相关包

import time
import pydash
import base64
import requests
from lxml import etree
from aip import AipFace
from pathlib import Path

百度云 人脸检测 申请信息

#唯一必须填的信息就这三行
APP_ID = "xxxxxxxx"
API_KEY = "xxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxxxxxx"
# 过滤颜值阈值,存储空间大的请随意
BEAUTY_THRESHOLD = 55
AUTHORIZATION = "oauth c3cef7c66a1843f8b3a9e6a1e3160e20"
# 如果权限错误,浏览器中打开知乎,在开发者工具复制一个,无需登录
# 建议最好换一个,因为不知道知乎的反爬虫策略,如果太多人用同一个,可能会影响程序运行

以下皆无需改动

# 每次请求知乎的讨论列表长度,不建议设定太长,注意节操
LIMIT = 5
# 这是话题『美女』的 ID,其是『颜值』(20013528)的父话题
SOURCE = "19552207"

爬虫假装下正常浏览器请求

USER_AGENT = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.5 Safari/534.55.3"
REFERER = "https://www.zhihu.com/topic/%s/newest" % SOURCE
# 某话题下讨论列表请求 url
BASE_URL = "https://www.zhihu.com/api/v4/topics/%s/feeds/timeline_activity"
# 初始请求 url 附带的请求参数
URL_QUERY = "?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.comment_count&limit=" + str(
  LIMIT)

HEADERS = {
  "User-Agent": USER_AGENT,
  "Referer": REFERER,
  "authorization": AUTHORIZATION

指定 url,获取对应原始内容 / 图片

def fetch_image(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.content

指定 url,获取对应 JSON 返回 / 话题列表

def fetch_activities(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.json()

处理返回的话题列表

def parser_activities(datums, face_detective):
  for data in datums["data"]:
    target = data["target"]
    if "content" not in target or "question" not in target or "author" not in target:
      continue
    html = etree.HTML(target["content"])
    seq = 0
    title = target["question"]["title"]
    author = target["author"]["name"]
    images = html.xpath("//img/@src")
    for image in images:
      if not image.startswith("http"):
        continue
      image_data = fetch_image(image)
      score = face_detective(image_data)
      if not score:
        continue
      name = "{}--{}--{}--{}.jpg".format(score, author, title, seq)
      seq = seq + 1
      path = Path(__file__).parent.joinpath("image").joinpath(name)
      try:
        f = open(path, "wb")
        f.write(image_data)
        f.flush()
        f.close()
        print(path)
        time.sleep(2)
      except Exception as e:
        continue
  if not datums["paging"]["is_end"]:
    return datums["paging"]["next"]
  else:
    return None

初始化颜值检测工具

def init_detective(app_id, api_key, secret_key):
  client = AipFace(app_id, api_key, secret_key)
  options = {"face_field": "age,gender,beauty,qualities"}
  def detective(image):
    image = str(base64.b64encode(image), "utf-8")
    response = client.detect(str(image), "BASE64", options)
    response = response.get("result")
    if not response:
      return
    if (not response) or (response["face_num"] == 0):
      return
    face_list = response["face_list"]
    if pydash.get(face_list, "0.face_probability") < 0.6:
      return
    if pydash.get(face_list, "0.beauty") < BEAUTY_THRESHOLD:
      return
    if pydash.get(face_list, "0.gender.type") != "female":
      return
    score = pydash.get(face_list, "0.beauty")
    return score
  return detective

程序入口

def main():
  face_detective = init_detective(APP_ID, API_KEY, SECRET_KEY)
  url = BASE_URL % SOURCE + URL_QUERY
  while url is not None:
    datums = fetch_activities(url)
    url = parser_activities(datums, face_detective)
    time.sleep(5)
if __name__ == '__main__':
  main()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python代码的打包与发布详解
Jul 30 Python
关于你不想知道的所有Python3 unicode特性
Nov 28 Python
Python中的__SLOTS__属性使用示例
Feb 18 Python
Python命令行参数解析模块optparse使用实例
Apr 13 Python
django 常用orm操作详解
Sep 13 Python
Python网络编程详解
Oct 31 Python
详解python分布式进程
Oct 08 Python
如何不用安装python就能在.NET里调用Python库
Jul 12 Python
python wav模块获取采样率 采样点声道量化位数(实例代码)
Jan 22 Python
Django接收照片储存文件的实例代码
Mar 07 Python
Django如何批量创建Model
Sep 01 Python
python实现监听键盘
Apr 26 Python
python3 enum模块的应用实例详解
Aug 12 #Python
Python一键查找iOS项目中未使用的图片、音频、视频资源
Aug 12 #Python
django+echart数据动态显示的例子
Aug 12 #Python
Flask框架学习笔记之使用Flask实现表单开发详解
Aug 12 #Python
Flask框架学习笔记之表单基础介绍与表单提交方式
Aug 12 #Python
python内存管理机制原理详解
Aug 12 #Python
Flask框架学习笔记之路由和反向路由详解【图文与实例】
Aug 12 #Python
You might like
php 调用远程url的六种方法小结
2009/11/02 PHP
PHP错误抑制符(@)导致引用传参失败Bug的分析
2011/05/02 PHP
php带抄送和密件抄送的邮件发送方法
2015/03/20 PHP
php计算年龄精准到年月日
2015/11/17 PHP
PHP传值到不同页面的三种常见方式及php和html之间传值问题
2015/11/19 PHP
php微信公众账号开发之前五个坑(一)
2016/09/18 PHP
PHP7 windows支持
2021/03/09 PHP
用js模拟JQuery的show与hide动画函数代码
2010/09/20 Javascript
js调用activeX获取u盘序列号的代码
2011/11/21 Javascript
Javascript 判断是否存在函数的方法
2013/01/03 Javascript
js格式化货币数据实现代码
2013/09/04 Javascript
js基础知识(公有方法、私有方法、特权方法)
2015/11/06 Javascript
AngularJS仿苹果滑屏删除控件
2016/01/18 Javascript
node.js中axios使用心得总结
2017/11/29 Javascript
Vue.js 2.0和Cordova开发webApp环境搭建方法
2018/02/26 Javascript
layui加载表格,绑定新增,编辑删除,查看按钮事件的例子
2019/09/06 Javascript
微信小程序 下拉刷新及上拉加载原理解析
2019/11/06 Javascript
实例讲解React 组件
2020/07/07 Javascript
[42:36]DOTA2上海特级锦标赛B组败者赛 VG VS Spirit第二局
2016/02/26 DOTA
分析Python读取文件时的路径问题
2018/02/11 Python
python实现美团订单推送到测试环境,提供便利操作示例
2019/08/09 Python
python 实现方阵的对角线遍历示例
2019/11/29 Python
python解压zip包中文乱码解决方法
2020/11/27 Python
Python爬虫模拟登陆哔哩哔哩(bilibili)并突破点选验证码功能
2020/12/21 Python
Spartoo英国:欧洲最大的网上鞋店
2016/09/13 全球购物
Casadei卡萨蒂官网:意大利奢侈鞋履品牌
2017/10/28 全球购物
同步和异步有何异同,在什么情况下分别使用他们
2013/04/09 面试题
绩效专员岗位职责
2013/12/02 职场文书
党员一句话承诺大全
2014/03/28 职场文书
关爱留守儿童标语
2014/06/18 职场文书
银行贷款收入证明
2014/10/17 职场文书
车间统计员岗位职责
2015/04/14 职场文书
2019个人工作计划书的格式及范文!
2019/07/04 职场文书
SQLServer2019 数据库环境搭建与使用的实现
2021/04/08 SQL Server
Pytorch可视化的几种实现方法
2021/06/10 Python
一篇文章带你掌握SQLite3基本用法
2022/06/14 数据库