- Gensim 教程
- Gensim - 主页
- Gensim - 简介
- Gensim - 入门
- Gensim - 文档和语料库
- Gensim - 向量和模型
- Gensim - 创建词典
- 创建词袋 (BoW) 语料库
- Gensim - 变换
- Gensim - 创建 TF-IDF 矩阵
- Gensim - 主题建模
- Gensim - 创建 LDA 主题模型
- Gensim - 使用 LDA 主题模型
- Gensim - 创建 LDA Mallet 模型
- Gensim - 文档和 LDA 模型
- Gensim - 创建 LSI 和 HDP 主题模型
- Gensim - 开发词嵌入
- Gensim - Doc2Vec 模型
- Gensim 的有用资源
- Gensim - 快速指南
- Gensim - 有用资源
- Gensim - 讨论
Gensim - Doc2Vec 模型
与 Word2Vec 模型相反,Doc2Vec 模型可用于创建一组词的矢量化表示,并将该组词统称为一个单一单元。它不仅提供句子中各个词的简单均值。
使用 Doc2Vec 创建文档向量
在使用 Doc2Vec 创建文档向量时,我们将使用可从 gensim.downloader 下载的 text8 数据集。
下载数据集
我们可以使用以下命令下载 text8 数据集 −
import gensim import gensim.downloader as api dataset = api.load("text8") data = [d for d in dataset]
下载 text8 数据集需要一些时间。
训练 Doc2Vec
为了训练模型,我们需要使用 models.doc2vec.TaggedDcument() 创建标记文档,如下所示 −
def tagged_document(list_of_list_of_words): for i, list_of_words in enumerate(list_of_list_of_words): yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i]) data_for_training = list(tagged_document(data))
我们可以打印训练好的数据集,如下所示 −
print(data_for_training [:1])
输出
[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been' , 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers', 'to', 'related', 'social', 'movements', 'that', 'advocate', 'the', 'elimination', 'of', 'authoritarian', 'institutions', 'particularly', 'the', 'state', 'the', 'word', 'anarchy', 'as', 'most', 'anarchists', 'use', 'it', 'does', 'not', 'imply', 'chaos', 'nihilism', 'or', 'anomie', 'but', 'rather', 'a', 'harmonious', 'anti', 'authoritarian', 'society', 'in', 'place', 'of', 'what', 'are', 'regarded', 'as', 'authoritarian', 'political', 'structures', 'and', 'coercive', 'economic', 'institutions', 'anarchists', 'advocate', 'social', 'relations', 'based', 'upon', 'voluntary', 'association', 'of', 'autonomous', 'individuals', 'mutual', 'aid', 'and', 'self', 'governance', 'while', 'anarchism', 'is', 'most', 'easily', 'defined', 'by', 'what', 'it', 'is', 'against', 'anarchists', 'also', 'offer', 'positive', 'visions', 'of', 'what', 'they', 'believe', 'to', 'be', 'a', 'truly', 'free', 'society', 'however', 'ideas', 'about', 'how', 'an', 'anarchist', 'society', 'might', 'work', 'vary', 'considerably', 'especially', 'with', 'respect', 'to', 'economics', 'there', 'is', 'also', 'disagreement', 'about', 'how', 'a', 'free', 'society', 'might', 'be', 'brought', 'about', 'origins', 'and', 'predecessors', 'kropotkin', 'and', 'others', 'argue', 'that', 'before', 'recorded', 'history', 'human', 'society', 'was', 'organized', 'on', 'anarchist', 'principles', 'most', 'anthropologists', 'follow', 'kropotkin', 'and', 'engels', 'in', 'believing', 'that', 'hunter', 'gatherer', 'bands', 'were', 'egalitarian', 'and', 'lacked', 'division', 'of', 'labour', 'accumulated', 'wealth', 'or', 'decreed', 'law', 'and', 'had', 'equal', 'access', 'to', 'resources', 'william', 'godwin', 'anarchists', 'including', 'the', 'the', 'anarchy', 'organisation', 'and', 'rothbard', 'find', 'anarchist', 'attitudes', 'in', 'taoism', 'from', 'ancient', 'china', 'kropotkin', 'found', 'similar', 'ideas', 'in', 'stoic', 'zeno', 'of', 'citium', 'according', 'to', 'kropotkin', 'zeno', 'repudiated', 'the', 'omnipotence', 'of', 'the', 'state', 'its', 'intervention', 'and', 'regimentation', 'and', 'proclaimed', 'the', 'sovereignty', 'of', 'the', 'moral', 'law', 'of', 'the', 'individual', 'the', 'anabaptists', 'of', 'one', 'six', 'th', 'century', 'europe', 'are', 'sometimes', 'considered', 'to', 'be', 'religious', 'forerunners', 'of', 'modern', 'anarchism', 'bertrand', 'russell', 'in', 'his', 'history', 'of', 'western', 'philosophy', 'writes', 'that', 'the', 'anabaptists', 'repudiated', 'all', 'law', 'since', 'they', 'held', 'that', 'the', 'good', 'man', 'will', 'be', 'guided', 'at', 'every', 'moment', 'by', 'the', 'holy', 'spirit', 'from', 'this', 'premise', 'they', 'arrive', 'at', 'communism', 'the', 'diggers', 'or', 'true', 'levellers', 'were', 'an', 'early', 'communistic', 'movement', (truncated…)
初始化模型
训练好之后,我们需要初始化模型。可以按照如下方式进行 −
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
然后,构建词汇表,如下所示 −
model.build_vocab(data_for_training)
现在,让我们训练 Doc2Vec 模型,如下所示 −
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
分析输出
最后,我们可以使用 model.infer_vector() 分析输出,如下所示 −
print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))
完整的实现示例
import gensim import gensim.downloader as api dataset = api.load("text8") data = [d for d in dataset] def tagged_document(list_of_list_of_words): for i, list_of_words in enumerate(list_of_list_of_words): yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i]) data_for_training = list(tagged_document(data)) print(data_for_training[:1]) model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30) model.build_vocab(data_training) model.train(data_training, total_examples=model.corpus_count, epochs=model.epochs) print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))
输出
[ -0.2556166 0.4829361 0.17081228 0.10879577 0.12525807 0.10077011 -0.21383236 0.19294572 0.11864349 -0.03227958 -0.02207291 -0.7108424 0.07165232 0.24221905 -0.2924459 -0.03543589 0.21840079 -0.1274817 0.05455418 -0.28968817 -0.29146606 0.32885507 0.14689675 -0.06913587 -0.35173815 0.09340707 -0.3803535 -0.04030455 -0.10004586 0.22192696 0.2384828 -0.29779273 0.19236489 -0.25727913 0.09140676 0.01265439 0.08077634 -0.06902497 -0.07175519 -0.22583418 -0.21653089 0.00347822 -0.34096122 -0.06176808 0.22885063 -0.37295452 -0.08222228 -0.03148199 -0.06487323 0.11387568 ]
广告