Python - 段落中词语计数



在从源读取文本时,有时我们也需要找出有关所用单词类型的一些统计信息。这使得有必要统计单词的数量以及给定文本中具有特定类型单词的行数。在下面的示例中,我们展示了使用两种不同方法来统计段落中单词数量的程序。为此,我们考虑了一个文本文件,其中包含好莱坞电影的摘要。

读取文件

FileName = ("Path\GodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()
    print lines_in_file 

当我们运行上述程序时,我们得到以下输出:

Vito Corleone is the aging don (head) of the Corleone Mafia Family. His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer. This does not please Sollozzo, who has the Don shot down by some of his hit men. The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.

使用nltk计数单词

接下来,我们使用nltk模块来统计文本中的单词。请注意,单词“(head)”被计为3个单词,而不是一个。

import nltk

FileName = ("Path\GodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()
    
    nltk_tokens = nltk.word_tokenize(lines_in_file)
    print nltk_tokens
    print "\n"
    print "Number of Words: " , len(nltk_tokens)

当我们运行上述程序时,我们得到以下输出:

['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(', 'head', ')', 'of', 'the', 'Corleone', 'Mafia', 'Family', '.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', '(', 'Michael', "'s", 'sister', ')', 'to', 'Carlo', 'Rizzi', '.', 'All', 'of', 'Michael', "'s", 'family', 'is', 'involved', 'with', 'the', 'Mafia', ',', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life', '.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money', '.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it', ',', 'but', ',', 'much', 'against', 'the', 'advice', 'of', 'the', 'Don', "'s", 'lawyer', 'Tom', 'Hagen', ',', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs', ',', 'and', 'turns', 'down', 'the', 'offer', '.', 'This', 'does', 'not', 'please', 'Sollozzo', ',', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men', '.', 'The', 'Don', 'barely', 'survives', ',', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart', '.']

Number of Words:  167

使用split计数单词

接下来,我们使用split函数来统计单词,在这里,单词“(head)”被计为一个单词,而不是像使用nltk那样计为3个单词。

FileName = ("Path\GodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()

    print lines_in_file.split()
    print "\n"
    print  "Number of Words: ", len(lines_in_file.split())

当我们运行上述程序时,我们得到以下输出:

['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(head)', 'of', 'the', 'Corleone', 'Mafia', 'Family.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', "(Michael's", 'sister)', 'to', 'Carlo', 'Rizzi.', 'All', 'of', "Michael's", 'family', 'is', 'involved', 'with', 'the', 'Mafia,', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it,', 'but,', 'much', 'against', 'the', 'advice', 'of', 'the', "Don's", 'lawyer', 'Tom', 'Hagen,', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs,', 'and', 'turns', 'down', 'the', 'offer.', 'This', 'does', 'not', 'please', 'Sollozzo,', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men.', 'The', 'Don', 'barely', 'survives,', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart.']

Number of Words:  146
广告