Python程序查找文本文件中的唯一单词数量

本文的给定任务是查找文本文件中的唯一单词数量。在这篇Python文章中，使用两个不同的示例给出了查找文本文件中的唯一单词及其数量的方法。在第一个示例中，从文本文件获取给定的单词，然后在计数这些唯一单词之前创建它们的唯一集合。在示例2中，首先创建单词列表，然后对其进行排序。在此之后，从此排序列表中删除重复项，最后计算文件中剩余的唯一单词以给出最终结果。

预处理算法

步骤1 - 使用Google帐户登录。转到Google Colab。打开一个新的Colab笔记本并在其中编写Python代码。

步骤2 - 首先将txt文件“file1.txt”上传到Google Colab。

步骤3 - 打开txt文件以进行读取。

步骤4 - 将文本文件转换为小写。

步骤5 - 要分隔txt文件中给定的单词，请使用split函数。

步骤6 - 打印名为“words_in_file”的列表，其中包含来自文本文件的单词。

用于这些示例的文本文件

file1.txt中的内容如下所示…

This is a new file.
This is made for testing purposes only.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
Oh! No.. there are seven lines now.

将file1.txt上传到colab

图：在Google Colab中上传file1.txt

方法1：- 使用Python集合查找文本文件中的唯一单词数量

在预处理步骤之后，以下步骤用于方法1

步骤1 - 从预处理步骤后的列表“words_in_file”开始。

步骤2 - 将此列表转换为集合。在这里，集合将仅包含唯一单词。

步骤3 - 使用print语句显示包含所有唯一单词的集合。

步骤4 - 查找集合长度。

步骤5 - 打印集合长度。

步骤6 - 这将给出给定字符串中唯一单词的数量。

示例

# Use open method to open the respective text file
file = open("file1.txt", 'r')

#Conversion of its content to lowercase
thegiventxtfile = file.read().lower()

#ALter the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe unique words given in this txt file are :\n")

#Convert to the python set
uniqueWords=set(words_in_file)

print(uniqueWords) 

#Find the number of words left in this list
numberofuniquewords=len(uniqueWords)

print("\nThe number of unique words given in this txt file are :\n")
print(numberofuniquewords)

输出

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The unique words given in this txt file are :

{'there', 'only.', 'testing', 'new', 'is', 'for', 'oh!', 'this', 'a', 'made', 'seven', 'are', 'purposes', 'in', 'file.', 'four', 'now.', 'no..', 'lines'}

The number of unique words given in this txt file are :

19

方法2：- 使用Python字典查找文本文件中的唯一单词数量

步骤1 - 打开所需的文件。

步骤2 - 对此列表进行排序并打印此列表。按字母顺序排序的列表也将显示重复的单词。

步骤3 - 现在，为了去除重复的单词并仅保留唯一的单词，请使用dict.fromkeys(words_in_file)

步骤4 - 现在必须将其转换回列表。

步骤5 - 最后打印包含唯一单词的列表。

步骤6 - 计算最终列表的长度并显示其值。这将给出给定字符串中唯一单词的数量。

示例

#Open the text file in read mode
file = open("file1.txt", 'r')

#Convert its content to lowercase
thegiventxtfile = file.read().lower()

#Change the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe sorted words list from this txt file is :\n")

#Sort this words file now
words_in_file.sort()
  
print(words_in_file)
print("\nThe sorted words list after removing duplicates from this txt file is :\n")

#Get rid of the duplicate words
myuniquewordlist = list(dict.fromkeys(words_in_file))

#Count the number of words left
numberofuniquewords=len(uniqueWords)

print(myuniquewordlist) 
print("\nThe number of unique words given in this txt file are :\n")

输出

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The sorted words list from this txt file is :

['a', 'are', 'are', 'are', 'are', 'are', 'file.', 'file.', 'file.', 'file.', 'file.', 'for', 'four', 'four', 'four', 'four', 'in', 'in', 'in', 'in', 'is', 'is', 'lines', 'lines', 'lines', 'lines', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'there', 'there', 'there', 'there', 'this', 'this', 'this', 'this', 'this', 'this']

The sorted words list after removing duplicates from this txt file is :

['a', 'are', 'file.', 'for', 'four', 'in', 'is', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'this']

The number of unique words given in this txt file are :

19

结论

两种不同的方法来展示如何查找给定txt文件中的唯一单词。首先，txt文件上传到colab笔记本中。然后打开此文件以进行读取。然后拆分此文件并分隔单词并将其存储为列表。在这篇Python文章中，此单词列表在两个示例中都使用。

在示例1中，使用了Python集合的概念。列表可能包含重复的单词。当此列表转换为集合时，只会保留唯一单词。要计算唯一单词的数量，使用len()函数。在示例2中，从txt文件获得的单词列表首先被排序以查看重复单词的数量，这些单词在排序后被放在一起。现在，此排序列表与dict.fromkeys(words_in_file)一起使用以删除重复的单词。稍后它用于查找重复单词的数量。

Saba Hilal

更新于：2023年7月10日

3K+ 次浏览

开启你的职业生涯

通过完成课程获得认证

开始