编程 Python

Python实现的爬取百度文库功能示例

Posted in Python onFebruary 16, 2019

本文实例讲述了Python实现的爬取百度文库功能。分享给大家供大家参考，具体如下：

# -*- coding: utf-8 -*-
from selenium import webdriver
from bs4 import BeautifulSoup
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH# 用来居中显示标题
from time import sleep
from selenium.webdriver.common.keys import Keys
# 浏览器安装路径
#BROWSER_PATH=\'C:\Users\Administrator\AppData\Local\Google\Chrome\Application\chromedriver.exe'
#目的URL
DEST_URL='https://wenku.baidu.com/view/aa31a84bcf84b9d528ea7a2c.html'
#用来保存文档
doc_title = ''
doc_content_list = []
def find_doc(driver, init=True):
  global doc_content_list
  global doc_title
  stop_condition = False
  html = driver.page_source
  soup1 = BeautifulSoup(html, 'lxml')
  if (init is True): # 得到标题
    title_result = soup1.find('div', attrs={'class': 'doc-title'})
    doc_title = title_result.get_text() # 得到文档标题
    # 拖动滚动条
    init_page = driver.find_element_by_xpath( "//div[@class='foldpagewg-text-con']")
    print(type(init_page), init_page)
    driver.execute_script('arguments[0].scrollIntoView();', init_page)
    init_page.click()
    init = False
  else:
    try:
      page = driver.find_element_by_xpath( "//div[@class='pagerwg-schedule']")
      #print(type(next_page), next_page)
      next_page = driver.find_element_by_class_name("pagerwg-button")
      station = driver.find_element_by_xpath( "//div[@class='bottombarwg-root border-none']")
      driver.execute_script('arguments[0].scrollIntoView(false);', station)
      #js.executeScript("arguments[0].click();",next_page);
      #sleep(5)
      '''js = "window.scrollTo(508,600)"
      driver.execute_script(js)'''
      next_page.click()
    except:
      #结束条件
      print("找不到元素")
      stop_condition = True
      #next_page.send_keys(Keys.ENTER)
      # 遍历所有的txt标签标定的文档，将其空格删除，然后进行保存
  content_result = soup1.find_all('p', attrs={'class': 'txt'})
  for each in content_result:
    each_text = each.get_text()
    if ' ' in each_text:
      text = each_text.replace(' ', '')
    else:
      text = each_text
    # print(each_text)
    doc_content_list.append(text)
          # 得到正文内容
  sleep(2) # 防止页面加载过慢
  if stop_condition is False:
    doc_title, doc_content_list = find_doc(driver, init)
  return doc_title, doc_content_list
def save(doc_title, doc_content_list):
  document = Document()
  heading = document.add_heading(doc_title, 0)
  heading.alignment = WD_ALIGN_PARAGRAPH.CENTER # 居中显示
  for each in doc_content_list:
    document.add_paragraph(each)
  # 处理字符编码问题
  t_title = doc_title.split()[0]
  #print(t_title)
  #document.save('2.docx')
  document.save('百度文库-%s.docx'% t_title)
  print("\n\nCompleted: %s.docx, to read." % t_title)
  driver.quit()
if __name__ == '__main__':
  options = webdriver.ChromeOptions()
  options.add_argument('user-agent="Mozilla/5.0 (Linux; Android 4.0.4; \ Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) \ Chrome/18.0.1025.133 Mobile Safari/535.19"')
  #driver = webdriver.Chrome(BROWSER_PATH, chrome_options=options)
  driver = webdriver.Chrome(chrome_options=options)
  driver.get(DEST_URL)
  #JavascriptExecutor js = (JavascriptExecutor) driver;
  print("**********START**********")
  title, content = find_doc(driver, True)
  save(title, content)
  driver.quit()

更多关于Python相关内容可查看本站专题：《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python实现的爬取百度文库功能示例

- Author -

i_have_a_girlfriend

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python FTP操作类代码分享

May 13 Python

从Python的源码来解析Python下的freeblock

May 11 Python

Python中的localtime()方法使用详解

May 22 Python

python实现在windows服务中新建进程的方法

Jun 30 Python

Python爬虫实现网页信息抓取功能示例【URL与正则模块】

May 18 Python

Python 多进程并发操作中进程池Pool的实例

Nov 01 Python

Python中max函数用于二维列表的实例

Apr 03 Python

Python Django框架单元测试之文件上传测试示例

May 17 Python

opencv python 图像轮廓/检测轮廓/绘制轮廓的方法

Jul 03 Python

浅谈python的elementtree模块处理中文注意事项

Mar 06 Python

TensorFlow-gpu和opencv安装详细教程

Jun 30 Python

Python基础详解之描述符

Apr 28 Python

对Python3 序列解包详解

Feb 16 #Python

对Python3 pyc 文件的使用详解

Feb 16 #Python

python 获得任意路径下的文件及其根目录的方法

Feb 16 #Python

Python通过for循环理解迭代器和生成器实例详解

Feb 16 #Python

Python3 导入上级目录中的模块实例

Feb 16 #Python

对Python3 goto 语句的使用方法详解

Feb 16 #Python

You might like

星际争霸任务指南——虫族

2020/03/04 星际争霸

php下intval()和(int)转换使用与区别

2008/07/18 PHP

php之对抗Web扫描器的脚本技巧

2008/10/01 PHP

javascript获取当前ip的代码

2009/05/10 Javascript

基于jquery实现的移入页面上空文本框时，让它变为焦点，移出清除焦点

2011/07/26 Javascript

jquery的ajax请求全面了解

2013/03/20 Javascript

javascript数组操作总结和属性、方法介绍

2014/04/05 Javascript

用js实现简单算法的实例代码

2016/09/24 Javascript

Ajax基础知识详解

2017/02/17 Javascript

基于jquery日历价格、库存等设置插件

2020/07/05 jQuery

详解VueJs中的V-bind指令

2018/05/03 Javascript

nodeJS进程管理器pm2的使用

2019/01/09 NodeJs

Object.keys() 和 Object.getOwnPropertyNames() 的区别详解

2020/05/21 Javascript

JavaScript中的全局属性与方法深入解析

2020/06/14 Javascript

Vue+ElementUI 中级联选择器Bug问题的解决

2020/07/31 Javascript

Python实现拷贝多个文件到同一目录的方法

2016/09/19 Python

python中正则的使用指南

2016/12/04 Python

python分治法求二维数组局部峰值方法

2018/04/03 Python

详解Python中pandas的安装操作说明(傻瓜版)

2019/04/08 Python

Python实现数据结构线性链表（单链表）算法示例

2019/05/04 Python

基于Pytorch SSD模型分析

2020/02/18 Python

在django项目中导出数据到excel文件并实现下载的功能

2020/03/13 Python

Python接口测试文件上传实例解析

2020/05/22 Python

html5 canvas 使用示例

2010/10/22 HTML / CSS

Web前端页面跳转并取到值

2017/04/24 HTML / CSS

html5利用canvas实现颜色容差抠图功能

2019/12/23 HTML / CSS

广州迈达威.net面试题目

2012/03/10 面试题

代码中finally中的代码会不会执行

2012/02/06 面试题

工程项目经理岗位职责

2013/12/15 职场文书

一年级学生评语

2014/04/23 职场文书

图书馆志愿者活动总结

2014/06/27 职场文书

2016年春节慰问信息

2015/03/25 职场文书

刑事案件上诉状

2015/05/23 职场文书

导游词之青岛崂山

2019/12/27 职场文书

Java SSH 秘钥连接mysql数据库的方法

2021/06/28 Java/Android

微信小程序 WeUI扩展组件库的入门教程

2022/04/21 Javascript