编程 Python

Python爬虫爬取新闻资讯案例详解

Posted in Python onJuly 14, 2020

前言

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。

一个简单的Python资讯采集案例，列表页到详情页，到数据保存，保存为txt文档，网站网页结构算是比较规整，简单清晰明了，资讯新闻内容的采集和保存！

Python爬虫爬取新闻资讯案例详解

应用到的库

requests，time，re，UserAgent，etree

import requests,time,re
from fake_useragent import UserAgent
from lxml import etree

列表页面

Python爬虫爬取新闻资讯案例详解

列表页，链接xpath解析

href_list=req.xpath('//ul[@class="news-list"]/li/a/@href')

详情页

Python爬虫爬取新闻资讯案例详解

内容xpath解析

h2=req.xpath('//div[@class="title-box"]/h2/text()')[0]
author=req.xpath('//div[@class="title-box"]/span[@class="news-from"]/text()')[0]
details=req.xpath('//div[@class="content-l detail"]/p/text()')

内容格式化处理

detail='\n'.join(details)

标题格式化处理，替换非法字符

pattern = r"[\/\\\:\*\?\"\<\>\|]"
new_title = re.sub(pattern, "_", title) # 替换为下划线

保存数据，保存为txt文本

def save(self,h2, author, detail):
with open(f'{h2}.txt','w',encoding='utf-8') as f:
f.write('%s%s%s%s%s'%(h2,'\n',detail,'\n',author))

print(f"保存{h2}.txt文本成功！")

遍历数据采集，yield处理

def get_tasks(self):
data_list = self.parse_home_list(self.url)
for item in data_list:
yield item

程序运行效果

Python爬虫爬取新闻资讯案例详解

程序采集效果

Python爬虫爬取新闻资讯案例详解

附源码参考：

# -*- coding: UTF-8 -*-

import requests,time,re
from fake_useragent import UserAgent
from lxml import etree

class RandomHeaders(object):
  ua=UserAgent()
  @property
  def random_headers(self):
    return {
      'User-Agent': self.ua.random,
    }

class Spider(RandomHeaders):
  def __init__(self,url):
    self.url=url


  def parse_home_list(self,url):
    response=requests.get(url,headers=self.random_headers).content.decode('utf-8')
    req=etree.HTML(response)
    href_list=req.xpath('//ul[@class="news-list"]/li/a/@href')
    print(href_list)
    for href in href_list:
      item = self.parse_detail(f'https://yz.chsi.com.cn{href}')
      yield item


  def parse_detail(self,url):
    print(f">>正在爬取{url}")
    try:
      response = requests.get(url, headers=self.random_headers).content.decode('utf-8')
      time.sleep(2)
    except Exception as e:
      print(e.args)
      self.parse_detail(url)
    else:
      req = etree.HTML(response)
      try:
        h2=req.xpath('//div[@class="title-box"]/h2/text()')[0]
        h2=self.validate_title(h2)
        author=req.xpath('//div[@class="title-box"]/span[@class="news-from"]/text()')[0]
        details=req.xpath('//div[@class="content-l detail"]/p/text()')
        detail='\n'.join(details)
        print(h2, author, detail)
        self.save(h2, author, detail)
        return h2, author, detail
      except IndexError:
        print(">>>采集出错需延时，5s后重试..")
        time.sleep(5)
        self.parse_detail(url)


  @staticmethod
  def validate_title(title):
    pattern = r"[\/\\\:\*\?\"\<\>\|]"
    new_title = re.sub(pattern, "_", title) # 替换为下划线
    return new_title



  def save(self,h2, author, detail):
    with open(f'{h2}.txt','w',encoding='utf-8') as f:
      f.write('%s%s%s%s%s'%(h2,'\n',detail,'\n',author))

    print(f"保存{h2}.txt文本成功！")
  def get_tasks(self):
    data_list = self.parse_home_list(self.url)
    for item in data_list:
      yield item
if __name__=="__main__":
  url="https://yz.chsi.com.cn/kyzx/jyxd/"
  spider=Spider(url)
  for data in spider.get_tasks():
    print(data)

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

Python爬虫爬取新闻资讯案例详解

- Author -

吃着东西不想停

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python随手笔记之标准类型内建函数

Dec 02 Python

python+selenium实现163邮箱自动登陆的方法

Dec 31 Python

对pandas将dataframe中某列按照条件赋值的实例讲解

Nov 29 Python

基于python if 判断选择结构的实例详解

May 06 Python

Python实现的栈、队列、文件目录遍历操作示例

May 06 Python

python实现爬取百度图片的方法示例

Jul 06 Python

python飞机大战pygame游戏背景设计详解

Dec 17 Python

Django更新models数据库结构步骤

Apr 01 Python

python安装cx_Oracle和wxPython的方法

Sep 14 Python

Python list去重且保持原顺序不变的方法

Apr 03 Python

python实现大文本文件分割成多个小文件

Apr 20 Python

Python实现Matplotlib,Seaborn动态数据图

May 06 Python

Win10下配置tensorflow-gpu的详细教程（无VS2015/2017）

Jul 14 #Python

Python实现图片查找轮廓、多边形拟合、最小外接矩形代码

Jul 14 #Python

python操作微信自动发消息的实现(微信聊天机器人)

Jul 14 #Python

python如何写try语句

Jul 14 #Python

Python操作MySQL数据库的示例代码

Jul 13 #Python

Python基于正则表达式实现计算器功能

Jul 13 #Python

python输出结果刷新及进度条的实现操作

Jul 13 #Python