使用 Python 中的 NLTK 进行停用词的词性标注?


自然语言处理背后的主要思想是,机器可以在一定程度上无需人工干预地进行某种形式的分析或处理,例如理解文本的部分含义或试图表达的内容。

在处理文本时,计算机需要从文本中过滤掉无用或不太重要的数据(单词)。在 NLTK 中,无用词(数据)被称为停用词。

安装所需的库

首先,您需要 nltk 库,只需在您的终端中运行以下命令即可

$pip install nltk

因此,我们将删除这些停用词,以便它们不会占用我们数据库的空间或占用宝贵的处理时间。

您可以创建您自己的单词列表,这些单词您可能认为是停用词。默认情况下,NLTK 包含一些它们认为是停用词的词组,您可以通过 NLTK 语料库访问它,方法是

>>> import nltk
>>> from nltk.corpus import stopwords

以下是 NLTK 停用词的列表

>>> set(stopwords.words('english'))
{'not', 'other', 'shan', "hadn't", 'she', 'did', 'through', 'and', 'does', "that'll", "weren't", 'your', "should've", "hasn't", 'myself', 'should', 'because', 'wasn', 'what', 'to', 'this', 'was', 'more', 'y', 'again', "needn't", 'into', 'above', 'themselves', 'd', "won't", 'during', 'haven', 'both', "shan't", 'their', 'on', 'hadn', 'up', 'once', 'its', 'against', 'before', 't', 'while', 'needn', 'doing', "don't", 'yourselves', 'until', 'is', 'all', 's', 'will', "you've", 'being', 'under', 'they', 'ours', 'wouldn', 'of', 'didn', 'below', 'just', 'ma', 'yours', "you'll", 'mightn', 'where', 'are', 'that', 'those', 'most', 'them', 'if', 'you', "shouldn't", 'off', 'for', 'her', 'such', 'now', 'than', 're', 'no', 'm', 'or', "aren't", 'further', 'here', "wasn't", 'after', "haven't", 'my', 'himself', 'at', 'had', 'yourself', 'by', 'weren', 'only', 'have', 'we', 'do', 'same', "isn't", 'herself', 'll', 'down', 'then', 'why', 'own', 'him', 'so', 'having', 'nor', 'isn', 'few', 'how', 'each', 'there', 'with', 'couldn', 'about', 'very', 'am', 'me', "didn't", "doesn't", 'which', "she's", 'doesn', 'were', 'he', 'in', "mightn't", 'when', 'our', 'who', 'his', "couldn't", 'the', "you'd", 'be', 'hers', 'hasn', 'between', 'it', 'mustn', 'but', 'out', 'can', "wouldn't", 'ourselves', 'whom', 'been', 'these', 'aren', 'over', 'itself', 'a', 'i', 'too', 'theirs', 'some', "you're", 'as', 'won', "it's", 'from', 'o', 'don', 'any', 've', 'ain', 'has', 'an', "mustn't", 'shouldn'}

下面是一个完整的程序,它将演示如何使用停用词从文本中删除停用词

示例代码

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "Python is a powerful high-level, object-oriented programming language created by Guido van Rossum."\
"It has simple easy-to-use syntax, making it the perfect language for someone trying to learn computer programming for the first time."\
"This is a comprehensive guide on how to get started in Python, why you should learn it and how you can learn it. However, if you knowledge "\
"of other programming languages and want to quickly get started with Python."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

输出

文本输出:无过滤(带停用词)

['Python', 'is', 'a', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'by', 'Guido', 'van', 'Rossum.It', 'has', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'it', 'the', 'perfect', 'language', 'for', 'someone', 'trying', 'to', 'learn', 'computer', 'programming', 'for', 'the', 'first', 'time.This', 'is', 'a', 'comprehensive', 'guide', 'on', 'how', 'to', 'get', 'started', 'in', 'Python', ',', 'why', 'you', 'should', 'learn', 'it', 'and', 'how', 'you', 'can', 'learn', 'it', '.', 'However', ',', 'if', 'you', 'knowledge', 'of', 'other', 'programming', 'languages', 'and', 'want', 'to', 'quickly', 'get', 'started', 'with', 'Python', '.']

文本输出:有过滤(删除停用词)

['Python', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'Guido', 'van', 'Rossum.It', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'perfect', 'language', 'someone', 'trying', 'learn', 'computer', 'programming', 'first', 'time.This', 'comprehensive', 'guide', 'get', 'started', 'Python', ',', 'learn', 'learn', '.', 'However', ',', 'knowledge', 'programming', 'languages', 'want', 'quickly', 'get', 'started', 'Python', '.']

更新于: 2019年7月30日

212 次查看

开启您的 职业生涯

通过完成课程获得认证

立即开始
广告