- Python - 文本处理
- Python - 文本处理简介
- Python - 文本处理环境
- Python - 不可变字符串
- Python - 排序行
- Python - 重新格式化段落
- Python - 统计段落中的标记
- Python - 二进制 ASCII 转换
- Python - 字符串作为文件
- Python - 反向文件读取
- Python - 过滤重复的单词
- Python - 从文本中提取电子邮件
- Python - 从文本中提取 URL
- Python - 漂亮打印
- Python - 文本处理状态机
- Python - 大写字母和小写字母转换
- Python - 标记化
- Python - 删除停用词
- Python - 同义词和反义词
- Python - 文本翻译
- Python - 单词替换
- Python - 拼写检查
- Python - WordNet 接口
- Python - 语料库访问
- Python - 单词标注
- Python - 块和切块
- Python - 块分类
- Python - 文本分类
- Python - 双语词
- Python - 处理 PDF
- Python - 处理 Word 文档
- Python - 读取 RSS 源
- Python - 情感分析
- Python - 搜索和匹配
- Python - 文本混编
- Python - 文本换行
- Python - 频率分布
- Python - 文本摘要
- Python - 词干算法
- Python - 受约束的搜索
Python - 单词标注
标记是文本处理的一项基本特性,其中我们将单词标记为语法类别。我们借助标记化和 pos_tag 函数为每个单词创建标记。
import nltk text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest") tagged_text=nltk.pos_tag(text) print(tagged_text)
当我们运行上述程序时,会获得以下输出 −
[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'), ('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'), ('the', 'DT'), ('nest', 'JJS')]
标记描述
我们可以使用以下程序来描述每个标记的含义,该程序显示内置值。
import nltk nltk.help.upenn_tagset('NN') nltk.help.upenn_tagset('IN') nltk.help.upenn_tagset('DT')
当我们运行上述程序时,会获得以下输出 −
NN: noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ... IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ... DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those
标记语料库
我们还可以标记语料库数据,并查看该语料库中每个单词的标记结果。
import nltk from nltk.tokenize import sent_tokenize from nltk.corpus import gutenberg sample = gutenberg.raw("blake-poems.txt") tokenized = sent_tokenize(sample) for i in tokenized[:2]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged)
当我们运行上述程序时,会获得以下输出 −
[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'), (]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'), (EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'), (THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'), (Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'), (,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'), (,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'), (a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'), (said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'), (a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]
广告