用 Python 中的 NLTK 去除停用词

当计算机处理自然语言时，从词汇表中完全排除了某些非常常见的、在帮助挑选符合用户需求的文档时看似价值不高的单词。这些单词被称为停用词。

例如，如果您输入以下句子：

John is a person who takes care of the people around him.

去除停用词后，您将获得以下输出：

['John', 'person', 'takes', 'care', 'people', 'around', '.']

NLTK 收集了我们可用来从给定句子中去除这些单词的停用词。它在 NLTK.corpus 模块中。我们可以用它来从句子中过滤停用词。例如：

示例

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

my_sent = "John is a person who takes care of people around him."
tokens = word_tokenize(my_sent)

filtered_sentence = [w for w in tokens if not w in stopwords.words()]

print(filtered_sentence)

输出

这会得到以下输出：

['John', 'person', 'takes', 'care', 'people', 'around', '.']

Samual Sam

更新时间：20-6-2020

601 次浏览

开启你的职业生涯

通过完成课程获得认证

开始

用 Python 中的 NLTK 去除停用词

示例

输出

开启你的 职业生涯

开启你的职业生涯