使用 Python 中的 NLTK 删除停用词

当计算机处理自然语言时，一些极其常见的词似乎对于帮助选择匹配用户需求的文档价值不大，因此会完全从词汇中排除。这些词称为停用词。

例如，如果你给出输入句子如下 −

John is a person who takes care of the people around him.

删除停用词后，你将得到如下输出 −

['John', 'person', 'takes', 'care', 'people', 'around', '.']

NLTK 有一组可用于从任何给定句子中删除这些停用词的停用词。它位于 NLTK.corpus 模块中。我们可以用它过滤句子中的停用词。例如，

示例

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

my_sent = "John is a person who takes care of people around him."
tokens = word_tokenize(my_sent)

filtered_sentence = [w for w in tokens if not w in stopwords.words()]

print(filtered_sentence)

输出

这会产生以下输出 −

['John', 'person', 'takes', 'care', 'people', 'around', '.']

Samual Sam

更新于： 20-6-2020

594 次浏览

开启你的职业生涯

完成课程获取认证

开始