Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
python语言使用技巧分享
May 31 Python
Python实现的圆形绘制(画圆)示例
Jan 31 Python
python学习笔记--将python源文件打包成exe文件(pyinstaller)
May 26 Python
python对html过滤处理的方法
Oct 21 Python
python调用百度地图WEB服务API获取地点对应坐标值
Jan 16 Python
对Python3 序列解包详解
Feb 16 Python
python安装pil库方法及代码
Jun 25 Python
利用python生成照片墙的示例代码
Apr 09 Python
python 双循环遍历list 变量判断代码
May 04 Python
Python3爬虫中识别图形验证码的实例讲解
Jul 30 Python
Python实现http接口自动化测试的示例代码
Oct 09 Python
python缺失值的解决方法总结
Jun 09 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
Yii的Srbac插件用法详解
2016/07/14 PHP
PHP时间相关常用函数用法示例
2020/06/03 PHP
jquery ui 1.7 ui.tabs 动态添加与关闭(按钮关闭+双击关闭)
2010/04/01 Javascript
asp.net 30分钟掌握无刷新 Repeater
2011/09/16 Javascript
Ext.get() 和 Ext.query()组合使用实现最灵活的取元素方式
2011/09/26 Javascript
Jquery index()方法 获取相应元素索引值
2012/10/12 Javascript
初识SmartJS - AOP三剑客
2014/06/08 Javascript
js实时获取并显示当前时间的方法
2015/07/31 Javascript
jquery-tips悬浮提示插件分享
2015/07/31 Javascript
vue使用laydate时间插件的方法
2018/11/14 Javascript
微信小程序按钮点击跳转页面详解
2019/05/06 Javascript
浅谈layui 绑定form submit提交表单的注意事项
2019/10/25 Javascript
[18:16]sakonoko 2017年卡尔集锦
2018/02/06 DOTA
Python3使用requests登录人人影视网站的方法
2016/05/11 Python
Python使用自带的ConfigParser模块读写ini配置文件
2016/06/26 Python
python编程培训 python培训靠谱吗
2018/01/17 Python
TensorFlow高效读取数据的方法示例
2018/02/06 Python
解决python爬虫中有中文的url问题
2018/05/11 Python
Selenium定时刷新网页的实现代码
2018/10/31 Python
Python实现的对本地host127.0.0.1主机进行扫描端口功能示例
2019/02/15 Python
PyTorch中的padding(边缘填充)操作方式
2020/01/03 Python
解决Python import .pyd 可能遇到路径的问题
2021/03/04 Python
针对HTML5的Web Worker使用攻略
2015/07/12 HTML / CSS
HTML5 视频播放(video),JavaScript控制视频的实例代码
2018/10/08 HTML / CSS
舞蹈教育学专业推荐信
2013/11/27 职场文书
个人求职信范例
2014/01/29 职场文书
中学生自我鉴定
2014/02/04 职场文书
公司年终奖分配方案
2014/06/16 职场文书
2014年教育培训工作总结
2014/12/08 职场文书
安全保证书怎么写
2015/02/28 职场文书
总经理助理岗位职责范本
2015/03/31 职场文书
义诊活动通知
2015/04/24 职场文书
python pyhs2 的安装操作
2021/04/07 Python
用php如何解决大文件分片上传问题
2021/07/07 PHP
教你如何让spark sql写mysql的时候支持update操作
2022/02/15 MySQL
Windows Server 2012 修改远程默认端口3389的方法
2022/04/28 Servers