Gensim - 变换

本章将帮助您了解 Gensim 中的各种变换。让我们从理解文档变换开始。

文档变换

文档变换意味着以一种可以进行数学运算的方式来表示文档。除了推断语料库的潜在结构外，文档变换还将实现以下目标：

它发现单词之间的关系。
它揭示了语料库中隐藏的结构。
它以一种新的、更具语义的方式描述文档。
它使文档的表示更加紧凑。
它提高了效率，因为新的表示消耗更少的资源。
它提高了功效，因为在新的表示中忽略了边缘数据趋势。
新的文档表示中也减少了噪声。

让我们看看将文档从一个向量空间表示转换为另一个向量空间表示的实现步骤。

实现步骤

为了变换文档，我们必须遵循以下步骤：

步骤 1：创建语料库

第一步也是最基本的一步是从文档中创建语料库。我们已经在之前的示例中创建了语料库。让我们再创建一个，并进行一些改进（删除常用词和只出现一次的词）：

import gensim
import pprint
from collections import defaultdict
from gensim import corpora

现在提供用于创建语料库的文档：

t_corpus = ["CNTK formerly known as Computational Network Toolkit", "is a free easy-to-use open-source commercial-grade toolkit", "that enable us to train deep learning algorithms to learn like the human brain.", "You can find its free tutorial on tutorialspoint.com", "Tutorialspoint.com also provide best technical tutorials on technologies like AI deep learning machine learning for free"]

接下来，我们需要进行标记化，同时还要删除常用词：

stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [
      word for word in document.lower().split() if word not in stoplist
   ]
	for document in t_corpus
]

以下脚本将删除只出现一次的单词：

frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1] 
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)

输出

[
   ['toolkit'],
   ['free', 'toolkit'],
   ['deep', 'learning', 'like'],
   ['free', 'on', 'tutorialspoint.com'],
   ['tutorialspoint.com', 'on', 'like', 'deep', 'learning', 'learning', 'free']
]

现在将其传递给`corpora.dictionary()`对象以获取语料库中的唯一对象：

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

输出

Dictionary(7 unique tokens: ['toolkit', 'free', 'deep', 'learning', 'like']...)

接下来，以下代码行将为我们的语料库创建词袋模型：

BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)

输出

[
   [(0, 1)],
   [(0, 1), (1, 1)],
   [(2, 1), (3, 1), (4, 1)],
   [(1, 1), (5, 1), (6, 1)],
   [(1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1)]
]

步骤 2：创建变换

变换是一些标准的 Python 对象。我们可以使用训练好的语料库来初始化这些变换，即 Python 对象。在这里，我们将使用`tf-idf`模型来创建我们训练好的语料库`BoW_corpus`的变换。

首先，我们需要从 gensim 中导入 models 包。

from gensim import models

现在，我们需要按如下方式初始化模型：

tfidf = models.TfidfModel(BoW_corpus)

步骤 3：变换向量

现在，在最后一步中，向量将从旧表示转换为新表示。由于我们在上述步骤中已经初始化了 tfidf 模型，因此 tfidf 现在将被视为只读对象。在这里，我们将使用此 tfidf 对象将我们的向量从词袋表示（旧表示）转换为 Tfidf 实值权重（新表示）。

doc_BoW = [(1,1),(3,1)]
print(tfidf[doc_BoW]

输出

[(1, 0.4869354917707381), (3, 0.8734379353188121)]

我们将变换应用于语料库的两个值，但我们也可以将其应用于整个语料库，如下所示：

corpus_tfidf = tfidf[BoW_corpus]
for doc in corpus_tfidf:
   print(doc)

输出

[(0, 1.0)]
[(0, 0.8734379353188121), (1, 0.4869354917707381)]
[(2, 0.5773502691896257), (3, 0.5773502691896257), (4, 0.5773502691896257)]
[(1, 0.3667400603126873), (5, 0.657838022678017), (6, 0.657838022678017)]
[
   (1, 0.19338287240886842), (2, 0.34687949360312714), (3, 0.6937589872062543), 
   (4, 0.34687949360312714), (5, 0.34687949360312714), (6, 0.34687949360312714)
]

完整的实现示例

import gensim
import pprint
from collections import defaultdict
from gensim import corpora
t_corpus = [
   "CNTK formerly known as Computational Network Toolkit", 
   "is a free easy-to-use open-source commercial-grade toolkit", 
   "that enable us to train deep learning algorithms to learn like the human brain.", 
   "You can find its free tutorial on tutorialspoint.com", 
   "Tutorialspoint.com also provide best technical tutorials on 
   technologies like AI deep learning machine learning for free"
]
stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [word for word in document.lower().split() if word not in stoplist]
   for document in t_corpus
]
frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1] 
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)
BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)
   from gensim import models
   tfidf = models.TfidfModel(BoW_corpus)
   doc_BoW = [(1,1),(3,1)]
   print(tfidf[doc_BoW])
   corpus_tfidf = tfidf[BoW_corpus]
   for doc in corpus_tfidf:
print(doc)

Gensim 中的各种变换

使用 Gensim，我们可以实现各种流行的变换，即向量空间模型算法。其中一些如下：

Tf-Idf（词频-逆文档频率）

在初始化期间，此 tf-idf 模型算法需要一个具有整数值（例如词袋模型）的训练语料库。然后，在变换时，它采用向量表示并返回另一个向量表示。

输出向量将具有相同的维度，但稀有特征（在训练时）的值将增加。它基本上将整数值向量转换为实数值向量。以下是 Tf-idf 变换的语法：

Model=models.TfidfModel(corpus, normalize=True)

LSI（潜在语义索引）

LSI 模型算法可以将文档从整数值向量模型（例如词袋模型）或 Tf-Idf 加权空间变换到潜在空间。输出向量的维度将较低。以下是 LSI 变换的语法：

Model=models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LDA（潜在狄利克雷分配）

LDA 模型算法是另一种将文档从词袋模型空间转换为主题空间的算法。输出向量的维度将较低。以下是 LSI 变换的语法：

Model=models.LdaModel(corpus, id2word=dictionary, num_topics=100)

随机投影 (RP)

RP 是一种非常有效的方法，旨在降低向量空间的维度。这种方法基本上近似于文档之间的 Tf-Idf 距离。它是通过引入一些随机性来实现的。

Model=models.RpModel(tfidf_corpus, num_topics=500)

层次狄利克雷过程 (HDP)

HDP 是一种非参数贝叶斯方法，它是 Gensim 的新增功能。使用它时需要注意。

Model=models.HdpModel(corpus, id2word=dictionary

打印页面