编程 Python

python下载微信公众号相关文章

Posted in Python onFebruary 26, 2019

本文实例为大家分享了python下载微信公众号相关文章的具体代码，供大家参考，具体内容如下

目的：从零开始学自动化测试公众号中下载“pytest"一系列文档

1、搜索微信号文章关键字搜索

2、对搜索结果前N页进行解析，获取文章标题和对应URL

主要使用的是requests和bs4中的Beautifulsoup

Weixin.py

import requests
from urllib.parse import quote
from bs4 import BeautifulSoup
import re
from WeixinSpider.HTML2doc import MyHTMLParser
 
class WeixinSpider(object):
 
 def __init__(self, gzh_name, pageno,keyword):
  self.GZH_Name = gzh_name
  self.pageno = pageno
  self.keyword = keyword.lower()
  self.page_url = []
  self.article_list = []
  self.headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
  self.timeout = 5
  # [...] 用来表示一组字符,单独列出：[amk] 匹配 'a'，'m'或'k'
  # re+ 匹配1个或多个的表达式。
  self.pattern = r'[\\/:*?"<>|\r\n]+'
 
 def get_page_url(self):
  for i in range(1,self.pageno+1):
   # https://weixin.sogou.com/weixin?query=从零开始学自动化测试&_sug_type_=&s_from=input&_sug_=n&type=2&page=2&ie=utf8
   url = "https://weixin.sogou.com/weixin?query=%s&_sug_type_=&s_from=input&_sug_=n&type=2&page=%s&ie=utf8" \
     % (quote(self.GZH_Name),i)
   self.page_url.append(url)
 
 def get_article_url(self):
  article = {}
  for url in self.page_url:
   response = requests.get(url,headers=self.headers,timeout=self.timeout)
   result = BeautifulSoup(response.text, 'html.parser')
   articles = result.select('ul[class="news-list"] > li > div[class="txt-box"] > h3 > a ')
   for a in articles:
    # print(a.text)
    # print(a["href"])
    if self.keyword in a.text.lower():
      new_text=re.sub(self.pattern,"",a.text)
      article[new_text] = a["href"]
      self.article_list.append(article)
 
 
 
headers = {'User-Agent':
      'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
timeout = 5
gzh_name = 'pytest文档'
My_GZH = WeixinSpider(gzh_name,5,'pytest')
My_GZH.get_page_url()
# print(My_GZH.page_url)
My_GZH.get_article_url()
# print(My_GZH.article_list)
for article in My_GZH.article_list:
 for (key,value) in article.items():
  url=value
  html_response = requests.get(url,headers=headers,timeout=timeout)
  myHTMLParser = MyHTMLParser(key)
  myHTMLParser.feed(html_response.text)
  myHTMLParser.doc.save(myHTMLParser.docfile)

HTML2doc.py

from html.parser import HTMLParser
import requests
from docx import Document
import re
from docx.shared import RGBColor
import docx
 
 
class MyHTMLParser(HTMLParser):
 def __init__(self,docname):
  HTMLParser.__init__(self)
  self.docname=docname
  self.docfile = r"D:\pytest\%s.doc"%self.docname
  self.doc=Document()
  self.title = False
  self.code = False
  self.text=''
  self.processing =None
  self.codeprocessing =None
  self.picindex = 1
  self.headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
  self.timeout = 5
 
 def handle_startendtag(self, tag, attrs):
  # 图片的处理比较复杂，首先需要找到对应的图片的url，然后下载并写入doc中
  if tag == "img":
   if len(attrs) == 0:
    pass
   else:
    for (variable, value) in attrs:
     if variable == "data-type":
      picname = r"D:\pytest\%s%s.%s" % (self.docname, self.picindex, value)
      # print(picname)
     if variable == "data-src":
      picdata = requests.get(value, headers=self.headers, timeout=self.timeout)
      # print(value)
    self.picindex = self.picindex + 1
    # print(self.picindex)
    with open(picname, "wb") as pic:
     pic.write(picdata.content)
    try:
     self.doc.add_picture(picname)
    except docx.image.exceptions.UnexpectedEndOfFileError as e:
     print(e)
 
 def handle_starttag(self, tag, attrs):
  if re.match(r"h(\d)", tag):
   self.title = True
  if tag =="p":
   self.processing = tag
  if tag == "code":
   self.code = True
   self.codeprocessing = tag
 
 def handle_data(self, data):
   if self.title == True:
    self.doc.add_heading(data, level=2)
   # if self.in_div == True and self.tag == "p":
   if self.processing:
    self.text = self.text + data
   if self.code == True:
    p =self.doc.add_paragraph()
    run=p.add_run(data)
    run.font.color.rgb = RGBColor(111,111,111)
 
 def handle_endtag(self, tag):
  self.title = False
  # self.code = False
  if tag == self.processing:
   self.doc.add_paragraph(self.text)
 
   self.processing = None
   self.text=''
  if tag == self.codeprocessing:
   self.code =False

运行结果：

python下载微信公众号相关文章

缺少部分文档，如pytest文档4，是因为搜狗微信文章搜索结果中就没有

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python下载微信公众号相关文章

- Author -

qd_tudou

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

详解Python编程中time模块的使用

Nov 20 Python

python动态加载包的方法小结

Apr 18 Python

Python实现获取磁盘剩余空间的2种方法

Jun 07 Python

python中numpy包使用教程之数组和相关操作详解

Jul 30 Python

Python中进程和线程的区别详解

Oct 29 Python

python3+PyQt5+Qt Designer实现扩展对话框

Apr 20 Python

Python 给屏幕打印信息加上颜色的实现方法

Apr 24 Python

详解numpy.meshgrid()方法使用

Aug 01 Python

详解python中的数据类型和控制流

Aug 08 Python

Python批量将图片灰度化的实现代码

Apr 11 Python

python+selenium+chrome批量文件下载并自动创建文件夹实例

Apr 27 Python

Django表单提交后实现获取相同name的不同value值

May 14 Python

python处理DICOM并计算三维模型体积

Feb 26 #Python

学习python可以干什么

Feb 26 #Python

Python3几个常见问题的处理方法

Feb 26 #Python

django 自定义过滤器的实现

Feb 26 #Python

使用Python将Mysql的查询数据导出到文件的方法

Feb 25 #Python

Python-ElasticSearch搜索查询的讲解

Feb 25 #Python

Python2 Selenium元素定位的实现(8种)

Feb 25 #Python

You might like

如何分别全角和半角以避免乱码

2006/10/09 PHP

php flush类输出缓冲剖析

2008/10/19 PHP

PHP临时文件的安全性分析

2014/07/04 PHP

Laravel 4 初级教程之Pages、表单验证

2014/10/30 PHP

PHP实现文件上传功能实例代码

2017/05/18 PHP

php操作mongodb封装类与用法实例

2018/09/01 PHP

用Javascript实现锚点(Anchor)间平滑跳转

2009/09/08 Javascript

Javascript 类与静态类的实现

2010/04/01 Javascript

处理及遍历XML文档DOM元素属性及方法整理

2013/08/23 Javascript

点击弹出层效果&弹出窗口后网页背景变暗效果的实现代码

2014/02/10 Javascript

用JavaScript实现一个代码简洁、逻辑不复杂的多级树

2014/05/23 Javascript

AngularJS 中文API参考手册

2016/07/28 Javascript

浅谈js之字面量、对象字面量的访问、关键字in的用法

2016/11/20 Javascript

扩展bootstrap的modal模态框-动态添加modal框-弹出多个modal框

2017/02/21 Javascript

vue.js单文件组件中非父子组件的传值实例

2018/09/13 Javascript

layui使用form表单实现post请求页面跳转的方法

2019/09/14 Javascript

微信小程序实现音乐播放器

2019/11/20 Javascript

vue2路由方式--嵌套路由实现方法分析

2020/03/06 Javascript

bootstrapValidator表单校验、更改状态、新增、移除校验字段的实例代码

2020/05/19 Javascript

js 将多个对象合并成一个对象 assign方法的实现

2020/09/24 Javascript

原生js实现表格翻页和跳转

2020/09/29 Javascript

[05:01]3.19DOTA2发布会我们都是刀塔人

2014/03/25 DOTA

[01:05:29]DOTA2-DPC中国联赛正赛 PSG.LGD vs Aster BO3 第二场 1月24日

2021/03/11 DOTA

python检查序列seq是否含有aset中项的方法

2015/06/30 Python

python在不同层级目录import模块的方法

2016/01/31 Python

Python开发的HTTP库requests详解

2017/08/29 Python

利用TensorFlow训练简单的二分类神经网络模型的方法

2018/03/05 Python

Python3.4学习笔记之类型判断，异常处理，终止程序操作小结

2019/03/01 Python

python中将两组数据放在一起按照某一固定顺序shuffle的实例

2019/07/15 Python

使用Python和Scribus创建一个RGB立方体的方法

2019/07/17 Python

Python模块的定义，模块的导入，__name__用法实例分析

2020/01/07 Python

解析HTML5的存储功能和web SQL的相关操作方法

2016/02/19 HTML / CSS

HTML5拖放API实现自动生成相框功能

2020/04/07 HTML / CSS

澳大利亚制造的羊皮靴：Original UGG Boots

2017/11/13 全球购物

美国最大的半成品净菜电商：Blue Apron（蓝围裙）

2018/04/27 全球购物

消防战士优秀事迹材料

2014/02/13 职场文书