- Python - 文本处理
- Python - 文本处理简介
- Python - 文本处理环境
- Python - 字符串不变性
- Python - 对行排序
- Python - 段落重新格式化
- Python - 段落中词语计数
- Python - 二进制ASCII转换
- Python - 字符串作为文件
- Python - 反向文件读取
- Python - 过滤重复单词
- Python - 从文本中提取电子邮件
- Python - 从文本中提取URL
- Python - 美化打印
- Python - 文本处理状态机
- Python - 大写和翻译
- Python - 分词
- Python - 删除停用词
- Python - 同义词和反义词
- Python - 文本翻译
- Python - 替换单词
- Python - 拼写检查
- Python - WordNet 接口
- Python - 语料库访问
- Python - 词性标注
- Python - 组块和组块间隙
- Python - 组块分类
- Python - 文本分类
- Python - 二元语法
- Python - 处理PDF
- Python - 处理Word文档
- Python - 读取RSS Feed
- Python - 情感分析
- Python - 搜索和匹配
- Python - 文本转换
- Python - 文本换行
- Python - 频率分布
- Python - 文本摘要
- Python - 词干提取算法
- Python - 受约束的搜索
Python - 段落中词语计数
在从源读取文本时,有时我们也需要找出有关所用单词类型的一些统计信息。这使得有必要统计单词的数量以及给定文本中具有特定类型单词的行数。在下面的示例中,我们展示了使用两种不同方法来统计段落中单词数量的程序。为此,我们考虑了一个文本文件,其中包含好莱坞电影的摘要。
读取文件
FileName = ("Path\GodFather.txt") with open(FileName, 'r') as file: lines_in_file = file.read() print lines_in_file
当我们运行上述程序时,我们得到以下输出:
Vito Corleone is the aging don (head) of the Corleone Mafia Family. His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer. This does not please Sollozzo, who has the Don shot down by some of his hit men. The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.
使用nltk计数单词
接下来,我们使用nltk模块来统计文本中的单词。请注意,单词“(head)”被计为3个单词,而不是一个。
import nltk FileName = ("Path\GodFather.txt") with open(FileName, 'r') as file: lines_in_file = file.read() nltk_tokens = nltk.word_tokenize(lines_in_file) print nltk_tokens print "\n" print "Number of Words: " , len(nltk_tokens)
当我们运行上述程序时,我们得到以下输出:
['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(', 'head', ')', 'of', 'the', 'Corleone', 'Mafia', 'Family', '.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', '(', 'Michael', "'s", 'sister', ')', 'to', 'Carlo', 'Rizzi', '.', 'All', 'of', 'Michael', "'s", 'family', 'is', 'involved', 'with', 'the', 'Mafia', ',', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life', '.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money', '.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it', ',', 'but', ',', 'much', 'against', 'the', 'advice', 'of', 'the', 'Don', "'s", 'lawyer', 'Tom', 'Hagen', ',', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs', ',', 'and', 'turns', 'down', 'the', 'offer', '.', 'This', 'does', 'not', 'please', 'Sollozzo', ',', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men', '.', 'The', 'Don', 'barely', 'survives', ',', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart', '.'] Number of Words: 167
使用split计数单词
接下来,我们使用split函数来统计单词,在这里,单词“(head)”被计为一个单词,而不是像使用nltk那样计为3个单词。
FileName = ("Path\GodFather.txt") with open(FileName, 'r') as file: lines_in_file = file.read() print lines_in_file.split() print "\n" print "Number of Words: ", len(lines_in_file.split())
当我们运行上述程序时,我们得到以下输出:
['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(head)', 'of', 'the', 'Corleone', 'Mafia', 'Family.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', "(Michael's", 'sister)', 'to', 'Carlo', 'Rizzi.', 'All', 'of', "Michael's", 'family', 'is', 'involved', 'with', 'the', 'Mafia,', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it,', 'but,', 'much', 'against', 'the', 'advice', 'of', 'the', "Don's", 'lawyer', 'Tom', 'Hagen,', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs,', 'and', 'turns', 'down', 'the', 'offer.', 'This', 'does', 'not', 'please', 'Sollozzo,', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men.', 'The', 'Don', 'barely', 'survives,', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart.'] Number of Words: 146
广告