Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
使用setup.py安装python包和卸载python包的方法
Nov 27 Python
python使用urllib2模块获取gravatar头像实例
Dec 18 Python
Python的MongoDB模块PyMongo操作方法集锦
Jan 05 Python
python+matplotlib实现动态绘制图片实例代码(交互式绘图)
Jan 20 Python
python 实现得到当前时间偏移day天后的日期方法
Dec 31 Python
Python3.6实现带有简单界面的有道翻译小程序
Apr 16 Python
利用Python库Scapy解析pcap文件的方法
Jul 23 Python
python标记语句块使用方法总结
Aug 05 Python
对YOLOv3模型调用时候的python接口详解
Aug 26 Python
Python实现密钥密码(加解密)实例详解
Apr 26 Python
使用Python实现微信拍一拍功能的思路代码
Jul 09 Python
Python-openpyxl表格读取写入的案例详解
Nov 02 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
php中iconv函数使用方法
2008/05/24 PHP
php FPDF类库应用实现代码
2009/03/20 PHP
在windows平台上构建自己的PHP实现方法(仅适用于php5.2)
2013/07/05 PHP
THinkPHP获取客户端IP与IP地址查询的方法
2016/11/14 PHP
PHP使用PhpSpreadsheet操作Excel实例详解
2020/03/26 PHP
屏蔽Flash右键信息的js代码
2010/01/17 Javascript
jQuery select操作控制方法小结
2010/05/26 Javascript
functional继承模式 摘自javascript:the good parts
2011/06/20 Javascript
js函数的引用, 关于内存的开销
2012/09/17 Javascript
js中页面的重新加载(当前页面/上级页面)及frame或iframe元素引用介绍
2013/01/24 Javascript
JavaScript获取XML数据附示例截图
2014/03/05 Javascript
基于jQuery实现下拉框
2014/11/24 Javascript
Javascript获取图片原始宽度和高度的方法详解
2016/09/20 Javascript
原生js实现回复评论功能
2017/01/18 Javascript
动态加载权限管理模块中的Vue组件
2018/01/16 Javascript
Vue使用vue-draggable 插件在不同列表之间拖拽功能
2020/03/12 Javascript
[38:39]完美世界DOTA2联赛循环赛 IO vs GXR BO2第二场 11.04
2020/11/05 DOTA
python实现dict版图遍历示例
2014/02/19 Python
基于Python实现的扫雷游戏实例代码
2014/08/01 Python
python根据出生年份简单计算生肖的方法
2015/03/27 Python
Python字符串处理实例详解
2017/05/18 Python
Python之自动获取公网IP的实例讲解
2017/10/01 Python
对tf.reduce_sum tensorflow维度上的操作详解
2018/07/26 Python
python简单验证码识别的实现方法
2019/05/10 Python
python模拟实现斗地主发牌
2020/01/07 Python
Python chardet库识别编码原理解析
2020/02/18 Python
基于canvas使用贝塞尔曲线平滑拟合折线段的方法
2018/01/10 HTML / CSS
加拿大便宜的隐形眼镜商店:Clearly
2016/09/15 全球购物
日本即尚网:JSHOPPERS.com(支持中文)
2019/12/03 全球购物
党章学习思想汇报
2014/01/14 职场文书
数控专业毕业生自荐信范文
2014/03/04 职场文书
《画杨桃》教学反思
2014/04/13 职场文书
新闻工作者先进事迹
2014/05/26 职场文书
小学标准化建设汇报材料
2014/08/16 职场文书
幼儿园感谢信
2015/01/21 职场文书
MySQL系列之四 SQL语法
2021/07/02 MySQL