本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:
第一版: 效率低
# -*- coding:utf-8 -*- #!python3 path = 'test.txt' with open(path,encoding='utf-8',newline='') as f: word = [] words_dict= {} for letter in f.read(): if letter.isalnum(): word.append(letter) elif letter.isspace(): #空白字符 空格 \t \n if word: word = ''.join(word).lower() #转小写 if word not in words_dict: words_dict[word] = 1 else: words_dict[word] += 1 word = [] #处理最后一个单词 if word: word = ''.join(word).lower() # 转小写 if word not in words_dict: words_dict[word] = 1 else: words_dict[word] += 1 word = [] for k,v in words_dict.items(): print(k,v)
运行结果:
we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1
第二版:
缺点:遇到大文件要一次读入内存,性能不好
# -*- coding:utf-8 -*- #!python3 import re path = 'test.txt' with open(path,'r',encoding='utf-8') as f: data = f.read() word_reg = re.compile(r'\w+') #word_reg = re.compile(r'\w+\b') word_list = word_reg.findall(data) word_list = [word.lower() for word in word_list] #转小写 word_set = set(word_list) #避免重复查询 # words_dict = {} # for word in word_set: # words_dict[word] = word_list.count(word) # 简洁写法 words_dict = {word: word_list.count(word) for word in word_set} for k,v in words_dict.items(): print(k,v)
运行结果:
on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1
第三版:
# -*- coding:utf-8 -*- #!python3 import re path = 'test.txt' with open(path, 'r', encoding='utf-8') as f: word_list = [] word_reg = re.compile(r'\w+') for line in f: #line_words = word_reg.findall(line) #比上面的正则更加简单 line_words = line.split() word_list.extend(line_words) word_set = set(word_list) # 避免重复查询 words_dict = {word: word_list.count(word) for word in word_set} for k, v in words_dict.items(): print(k, v)
运行结果:
childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1
第四版:使用Counter
统计
# -*- coding:utf-8 -*- #!python3 import collections import re path = 'test.txt' with open(path, 'r', encoding='utf-8') as f: word_list = [] word_reg = re.compile(r'\w+') for line in f: line_words = line.split() word_list.extend(line_words) words_dict = dict(collections.Counter(word_list)) #使用Counter统计 for k, v in words_dict.items(): print(k, v)
运行结果:
We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1
注:这里使用的测试文本test.txt如下:
We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.
PS:这里再为大家推荐2款相关统计工具供大家参考:
在线字数统计工具:
http://tools.jb51.net/code/zishutongji
在线字符统计与编辑工具:
http://tools.jb51.net/code/char_tongji
更多关于Python相关内容感兴趣的读者可查看本站专题:《Python文件与目录操作技巧汇总》、《Python文本文件操作技巧汇总》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》及《Python入门与进阶经典教程》
希望本文所述对大家Python程序设计有所帮助。
免责声明:本站资源来自互联网收集,仅供用于学习和交流,请遵循相关法律法规,本站一切资源不代表本站立场,如有侵权、后门、不妥请联系本站删除!
更新日志
- 刘文正《流金三十年》[6N纯银镀膜][低速原抓WAV+CUE]
- 赵传.1994-精挑细选精选集【滚石】【WAV+CUE】
- 郑亚弦.2024-隔壁包厢603(EP)【发现梦想】【FLAC分轨】
- 文章.2004-被遗忘的时光【华博音乐】【WAV+CUE】
- 群星《青葱韶歌》原力计划·毕业季企划合辑[FLAC+分轨][661M]
- 群星《抖烧 DSD》抖音神曲 [WAV分轨][992M]
- 庾澄庆《哈林天堂》索尼音乐[WAV+CUE][1G]
- 英雄联盟全球总决赛多久打一次 全球总决赛举办频率介绍
- 第二届老头杯什么时候开始选人 第二届老头杯选人时间介绍
- 英雄联盟第二届老头杯什么时候开始 老头杯s2赛程时间队伍名单汇总
- AI赋能卓越显示技术共筑数字未来:三星显示器产品矩阵亮相2024进博会
- 技术剖析:天玑9400如何打造移动最强GPU和游戏体验?
- 顶级装备 实力登顶:三星显示器双十一焕新升级最后冲刺
- 陈影《绝色靓声》WAV+CUE
- 龚玥《禅是一枝花(6N纯银SQCD)》原抓WAV+CUE