编程 Python

python 提取html文本的方法

Posted in Python onMay 20, 2021

假设我们需要从各种网页中提取全文，并且要剥离所有HTML标记。通常，默认解决方案是使用BeautifulSoup软件包中的get_text方法，该方法内部使用lxml。这是一个经过充分测试的解决方案，但是在处理成千上万个HTML文档时可能会非常慢。
通过用selectolax替换BeautifulSoup，您几乎可以免费获得5-30倍的加速！
这是一个简单的基准测试，可分析commoncrawl(`处理NLP问题时，有时您需要获得大量的文本集。互联网是文本的最大来源，但是不幸的是，从任意HTML页面提取文本是一项艰巨而痛苦的任务。
假设我们需要从各种网页中提取全文，并且要剥离所有HTML标记。通常，默认解决方案是使用BeautifulSoup软件包中的get_text方法，该方法内部使用lxml。这是一个经过充分测试的解决方案，但是在处理成千上万个HTML文档时可能会非常慢。
通过用selectolax替换BeautifulSoup，您几乎可以免费获得5-30倍的加速！这是一个简单的基准测试，可分析commoncrawl(https://commoncrawl.org/)的10,000个HTML页面：

# coding: utf-8

from time import time

import warc
from bs4 import BeautifulSoup
from selectolax.parser import HTMLParser


def get_text_bs(html):
    tree = BeautifulSoup(html, 'lxml')

    body = tree.body
    if body is None:
        return None

    for tag in body.select('script'):
        tag.decompose()
    for tag in body.select('style'):
        tag.decompose()

    text = body.get_text(separator='\n')
    return text


def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='\n')
    return text


def read_doc(record, parser=get_text_selectolax):
    url = record.url
    text = None

    if url:
        payload = record.payload.read()
        header, html = payload.split(b'\r\n\r\n', maxsplit=1)
        html = html.strip()

        if len(html) > 0:
            text = parser(html)

    return url, text


def process_warc(file_name, parser, limit=10000):
    warc_file = warc.open(file_name, 'rb')
    t0 = time()
    n_documents = 0
    for i, record in enumerate(warc_file):
        url, doc = read_doc(record, parser)

        if not doc or not url:
            continue

        n_documents += 1

        if i > limit:
            break

    warc_file.close()
    print('Parser: %s' % parser.__name__)
    print('Parsing took %s seconds and produced %s documents\n' % (time() - t0, n_documents))

>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz
>>> file_name = "CC-MAIN-20180116070444-20180116090444-00000.warc.gz"
>>> process_warc(file_name, get_text_selectolax, 10000)
Parser: get_text_selectolax
Parsing took 16.170367002487183 seconds and produced 3317 documents
>>> process_warc(file_name, get_text_bs, 10000)
Parser: get_text_bs
Parsing took 432.6902508735657 seconds and produced 3283 documents

显然，这并不是对某些事物进行基准测试的最佳方法，但是它提供了一个想法，即selectolax有时比lxml快30倍。
selectolax最适合将HTML剥离为纯文本。如果我有10,000多个HTML片段，需要将它们作为纯文本索引到Elasticsearch中。（Elasticsearch有一个html_strip文本过滤器，但这不是我想要/不需要在此上下文中使用的过滤器）。事实证明，以这种规模将HTML剥离为纯文本实际上是非常低效的。那么，最有效的方法是什么？

PyQuery

from pyquery import PyQuery as pq

text = pq(html).text()

selectolax

from selectolax.parser import HTMLParser

text = HTMLParser(html).text()

正则表达式

import re

regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)

结果

我编写了一个脚本来计算时间，该脚本遍历包含HTML片段的10,000个文件。注意！这些片段不是完整的<html>文档（带有<head>和<body>等），只是HTML的一小部分。平均大小为10,314字节（中位数为5138字节）。结果如下：

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms

我已经运行了很多次，结果非常稳定。重点是：selectolax比PyQuery快7倍。

正则表达式好用？真的吗？

对于最基本的HTML Blob，它可能工作得很好。实际上，如果HTML是<p> Foo＆amp; Bar </ p>，我希望纯文本转换应该是Foo＆Bar，而不是Foo＆amp; bar。
更重要的一点是，PyQuery和selectolax支持非常特定但对我的用例很重要的内容。在继续之前，我需要删除某些标签（及其内容）。例如：

<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>

正则表达式永远无法做到这一点。

2.0 版本

因此，我的要求可能会发生变化，但基本上，我想删除某些标签。例如：<div class =“ warning”> 、 <div class =“ hidden”> 和 <div style =“ display：none”>。因此，让我们实现一下：

PyQuery

from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()

selectolax

from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()

这实际上有效。当我现在为10,000个片段运行相同的基准时，新结果如下：

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip

同样，selectolax击败PyQuery约6倍。

结论

正则表达式速度快，但功能弱。selectolax的效率令人印象深刻。

以上就是python 提取html文本的方法的详细内容，更多关于python 提取html文本的资料请关注三水点靠木其它相关文章！

python 提取html文本的方法

- Author -

Python中文社区

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python实例分享：快速查找出被挂马的文件

Jun 08 Python

Python中实现常量（Const）功能

Jan 28 Python

Android 兼容性问题：java.lang.UnsupportedOperationException解决办法

Mar 19 Python

Python实现解析Bit Torrent种子文件内容的方法

Aug 29 Python

浅谈DataFrame和SparkSql取值误区

Jun 09 Python

树莓派采用socket方式文件传输（python）

Jun 22 Python

在python中logger setlevel没有生效的解决

Feb 21 Python

Python3实现英文字母转换哥特式字体实例代码

Sep 01 Python

详解pycharm配置python解释器的问题

Oct 15 Python

基于python实现百度语音识别和图灵对话

Nov 02 Python

Cpython解释器中的GIL全局解释器锁

Nov 09 Python

Python的scikit-image模块实例讲解

Dec 30 Python

学会用Python实现滑雪小游戏,再也不用去北海道啦

pytorch 带batch的tensor类型图像显示操作

pytorch 中nn.Dropout的使用说明

May 20 #Python

Python 线程池模块之多线程操作代码

May 20 #Python

pytorch中[..., 0]的用法说明

May 20 #Python

浅谈pytorch中stack和cat的及to_tensor的坑

May 20 #Python

pytorch实现手写数字图片识别

You might like

DOTA2 无惧惊涛骇浪昆卡大型水友攻略

2020/04/20 DOTA

如何隐藏你的.php文件

2007/01/04 PHP

PHP函数preg_match_all正则表达式的基本使用详细解析

2013/08/31 PHP

PHP获取栏目的所有子级和孙级栏目的ID号示例

2014/04/01 PHP

PHP+fiddler抓包采集微信文章阅读数点赞数的思路详解

2019/12/20 PHP

javascript new 需不需要继续使用

2009/07/02 Javascript

网页自动跳转代码收集

2009/09/27 Javascript

JQuery each()函数如何优化循环DOM结构的性能

2012/12/10 Javascript

js隐式全局变量造成的bug示例代码

2014/04/22 Javascript

使用documentElement正确取得当前可见区域的大小

2014/07/25 Javascript

12306验证码破解思路分享

2015/03/25 Javascript

jQuery简单操作cookie的插件实例

2016/01/13 Javascript

js for循环倒序输出数组元素的实例

2017/03/01 Javascript

Vue 创建组件的两种方法小结(必看)

2018/02/23 Javascript

详解SPA中前端路由基本原理与实现方式

2018/09/12 Javascript

小程序怎样让wx.navigateBack更好用的方法实现

2019/11/01 Javascript

使用python检测手机QQ在线状态的脚本代码

2013/02/10 Python

Python pass 语句使用示例

2014/03/11 Python

python创建列表和向列表添加元素的实现方法

2017/12/25 Python

Python实现批量读取图片并存入mongodb数据库的方法示例

2018/04/02 Python

在python中,使用scatter绘制散点图的实例

2019/07/03 Python

Python调用jar包方法实现过程解析

2020/08/11 Python

Python GUI之tkinter窗口视窗教程大集合(推荐)

2020/10/20 Python

Python实现疫情地图可视化

2021/02/05 Python

10 套华丽的CSS3 按钮小结

2012/10/03 HTML / CSS

CSS伪类与CSS伪元素的区别及由来具体说明

2012/12/07 HTML / CSS

你可能不熟练的十个前端HTML5经典面试题

2018/07/03 HTML / CSS

美国嘻哈首饰购物网站：Hip Hop Bling

2016/12/30 全球购物

Waterford英国官方网站：世界上最受欢迎的优质水晶品牌

2019/08/17 全球购物

使用useBean标志初始化BEAN时如何接受初始化参数

2012/02/11 面试题

车辆维修工自我评价怎么写

2013/09/20 职场文书

自强自立美德少年事迹材料

2014/08/16 职场文书

大学生自我评价200字（4篇）

2014/09/17 职场文书

公安机关查摆剖析材料

2014/10/10 职场文书

成都人事代理协议书

2014/10/25 职场文书

Java虚拟机内存结构及编码实战分享

2022/04/07 Java/Android