Gensim - Doc2Vec 模型



与 Word2Vec 模型相反,Doc2Vec 模型可用于创建一组词的矢量化表示,并将该组词统称为一个单一单元。它不仅提供句子中各个词的简单均值。

使用 Doc2Vec 创建文档向量

在使用 Doc2Vec 创建文档向量时,我们将使用可从 gensim.downloader 下载的 text8 数据集。

下载数据集

我们可以使用以下命令下载 text8 数据集 −

import gensim
import gensim.downloader as api
dataset = api.load("text8")
data = [d for d in dataset]

下载 text8 数据集需要一些时间。

训练 Doc2Vec

为了训练模型,我们需要使用 models.doc2vec.TaggedDcument() 创建标记文档,如下所示 −

def tagged_document(list_of_list_of_words):
   for i, list_of_words in enumerate(list_of_list_of_words):
      yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
data_for_training = list(tagged_document(data))

我们可以打印训练好的数据集,如下所示 −

print(data_for_training [:1])

输出

[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of',
'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals',
'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 
'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution',
'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative',
'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 
'means', 'to', 'destroy',
'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been'
, 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined',
'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the',
'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 
'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 
'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished',
'although', 'there', 'are', 'differing', 'interpretations', 'of', 
'what', 'this', 'means', 'anarchism', 'also', 'refers', 'to', 
'related', 'social', 'movements', 'that', 'advocate', 'the', 
'elimination', 'of', 'authoritarian', 'institutions', 'particularly',
'the', 'state', 'the', 'word', 'anarchy', 'as', 'most', 'anarchists', 
'use', 'it', 'does', 'not', 'imply', 'chaos', 'nihilism', 'or', 'anomie',
'but', 'rather', 'a', 'harmonious', 'anti', 'authoritarian', 'society', 
'in', 'place', 'of', 'what', 'are', 'regarded', 'as', 'authoritarian',
'political', 'structures', 'and', 'coercive', 'economic', 'institutions', 
'anarchists', 'advocate', 'social', 'relations', 'based', 'upon', 'voluntary',
'association', 'of', 'autonomous', 'individuals', 'mutual', 'aid', 'and', 
'self', 'governance', 'while', 'anarchism', 'is', 'most', 'easily', 'defined',
'by', 'what', 'it', 'is', 'against', 'anarchists', 'also', 'offer', 
'positive', 'visions', 'of', 'what', 'they', 'believe', 'to', 'be', 'a',
'truly', 'free', 'society', 'however', 'ideas', 'about', 'how', 'an', 'anarchist',
'society', 'might', 'work', 'vary', 'considerably', 'especially', 'with',
'respect', 'to', 'economics', 'there', 'is', 'also', 'disagreement', 'about', 
'how', 'a', 'free', 'society', 'might', 'be', 'brought', 'about', 'origins', 
'and', 'predecessors', 'kropotkin', 'and', 'others', 'argue', 'that', 'before',
'recorded', 'history', 'human', 'society', 'was', 'organized', 'on', 'anarchist', 
'principles', 'most', 'anthropologists', 'follow', 'kropotkin', 'and', 'engels', 
'in', 'believing', 'that', 'hunter', 'gatherer', 'bands', 'were', 'egalitarian',
'and', 'lacked', 'division', 'of', 'labour', 'accumulated', 'wealth', 'or', 'decreed',
'law', 'and', 'had', 'equal', 'access', 'to', 'resources', 'william', 'godwin', 
'anarchists', 'including', 'the', 'the', 'anarchy', 'organisation', 'and', 'rothbard',
'find', 'anarchist', 'attitudes', 'in', 'taoism', 'from', 'ancient', 'china', 
'kropotkin', 'found', 'similar', 'ideas', 'in', 'stoic', 'zeno', 'of', 'citium', 
'according', 'to', 'kropotkin', 'zeno', 'repudiated', 'the', 'omnipotence', 'of',
'the', 'state', 'its', 'intervention', 'and', 'regimentation', 'and', 'proclaimed',
'the', 'sovereignty', 'of', 'the', 'moral', 'law', 'of', 'the', 'individual', 'the',
'anabaptists', 'of', 'one', 'six', 'th', 'century', 'europe', 'are', 'sometimes',
'considered', 'to', 'be', 'religious', 'forerunners', 'of', 'modern', 'anarchism',
'bertrand', 'russell', 'in', 'his', 'history', 'of', 'western', 'philosophy', 
'writes', 'that', 'the', 'anabaptists', 'repudiated', 'all', 'law', 'since', 
'they', 'held', 'that', 'the', 'good', 'man', 'will', 'be', 'guided', 'at', 
'every', 'moment', 'by', 'the', 'holy', 'spirit', 'from', 'this', 'premise',
'they', 'arrive', 'at', 'communism', 'the', 'diggers', 'or', 'true', 'levellers', 
'were', 'an', 'early', 'communistic', 'movement',
(truncated…)

初始化模型

训练好之后,我们需要初始化模型。可以按照如下方式进行 −

model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)

然后,构建词汇表,如下所示 −

model.build_vocab(data_for_training)

现在,让我们训练 Doc2Vec 模型,如下所示 −

model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

分析输出

最后,我们可以使用 model.infer_vector() 分析输出,如下所示 −

print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))

完整的实现示例

import gensim
import gensim.downloader as api
dataset = api.load("text8")
data = [d for d in dataset]
def tagged_document(list_of_list_of_words):
   for i, list_of_words in enumerate(list_of_list_of_words):
      yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
data_for_training = list(tagged_document(data))
print(data_for_training[:1])
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
model.build_vocab(data_training)
model.train(data_training, total_examples=model.corpus_count, epochs=model.epochs)
print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))

输出

[
   -0.2556166 0.4829361 0.17081228 0.10879577 0.12525807 0.10077011
   -0.21383236 0.19294572 0.11864349 -0.03227958 -0.02207291 -0.7108424
   0.07165232 0.24221905 -0.2924459 -0.03543589 0.21840079 -0.1274817
   0.05455418 -0.28968817 -0.29146606 0.32885507 0.14689675 -0.06913587
   -0.35173815 0.09340707 -0.3803535 -0.04030455 -0.10004586 0.22192696
   0.2384828 -0.29779273 0.19236489 -0.25727913 0.09140676 0.01265439
   0.08077634 -0.06902497 -0.07175519 -0.22583418 -0.21653089 0.00347822
   -0.34096122 -0.06176808 0.22885063 -0.37295452 -0.08222228 -0.03148199
   -0.06487323 0.11387568
]
广告