Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
Windows下Eclipse+PyDev配置Python+PyQt4开发环境
May 17 Python
Python实现的查询mysql数据库并通过邮件发送信息功能
May 17 Python
Python实现的多叉树寻找最短路径算法示例
Jul 30 Python
由面试题加深对Django的认识理解
Jul 19 Python
Pandas时间序列重采样(resample)方法中closed、label的作用详解
Dec 10 Python
将python包发布到PyPI和制作whl文件方式
Dec 25 Python
Python中如何将一个类方法变为多个方法
Dec 30 Python
关于Django Models CharField 参数说明
Mar 31 Python
python tkiner实现 一个小小的图片翻页功能的示例代码
Jun 24 Python
教你怎么用Python实现多路径迷宫
Apr 29 Python
聊聊基于pytorch实现Resnet对本地数据集的训练问题
Mar 25 Python
如何利用python实现Simhash算法
Jun 28 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
PHP MSSQL 存储过程的方法
2008/12/24 PHP
使用zend studio for eclipse不能激活代码提示功能的解决办法
2009/10/11 PHP
php图片的裁剪与缩放生成符合需求的缩略图
2013/01/11 PHP
PHP实用函数分享之去除多余的0
2015/02/06 PHP
利用PHP生成CSV文件简单示例
2016/12/21 PHP
通过DOM脚本去设置样式信息
2010/09/19 Javascript
ASP.NET jQuery 实例16 通过控件CustomValidator验证RadioButtonList
2012/02/03 Javascript
Javascript算符的优先级介绍
2013/03/20 Javascript
js正文内容高亮效果的实现方法
2013/06/30 Javascript
innerText 使用示例
2014/01/23 Javascript
JavaScript字符串对象substr方法入门实例(用于截取字符串)
2014/10/16 Javascript
javascript设置连续两次点击按钮时间间隔的方法
2014/10/28 Javascript
js控制网页前进和后退的方法
2015/06/08 Javascript
jquery调整表格行tr上下顺序实例讲解
2016/01/09 Javascript
详解JavaScript对象序列化
2016/01/19 Javascript
jQuery中的一些常见方法小结(推荐)
2016/06/13 Javascript
详解Angular的数据显示优化处理
2016/12/26 Javascript
微信小程序调用PHP后台接口 解析纯html文本
2017/06/13 Javascript
Nodejs实现爬虫抓取数据实例解析
2018/07/05 NodeJs
layui table 参数设置方法
2018/08/14 Javascript
9102年webpack4搭建vue项目的方法步骤
2019/02/20 Javascript
JS实现压缩上传图片base64长度功能
2019/12/03 Javascript
JavaScript实现简单计算器
2020/03/19 Javascript
Taro小程序自定义顶部导航栏功能的实现
2020/12/17 Javascript
[04:27]2014DOTA2国际邀请赛 NAVI战队官方纪录片
2014/07/21 DOTA
Python获取当前时间的方法
2014/01/14 Python
python使用turtle库绘制树
2018/06/25 Python
Django框架配置mysql数据库实现过程
2020/04/22 Python
如何理解python面向对象编程
2020/06/01 Python
Pycharm安装第三方库失败解决方案
2020/11/17 Python
美国时尚假发购物网站:Wigsbuy
2019/04/06 全球购物
Agoda中文官网:安可达(低价预订全球酒店)
2021/01/18 全球购物
KTV的创业计划书范文
2014/02/02 职场文书
乡镇总工会学雷锋活动总结
2014/03/01 职场文书
2016年优秀团支部事迹材料
2016/02/26 职场文书
自荐信范文
2019/05/20 职场文书