Gensim - 创建词袋 (BoW) 语料库

我们已经了解了如何从文档列表和文本文件（从一个或多个）创建词典。现在，在本节中，我们将创建一个词袋 (BoW) 语料库。为了使用 Gensim，这是我们需要熟悉的最重要的对象之一。基本上，它是包含每个文档中单词 ID 及其频率的语料库。

创建 BoW 语料库

如上所述，在 Gensim 中，语料库包含每个文档中单词 ID 及其频率。我们可以从简单的文档列表和文本文件创建 BoW 语料库。我们需要做的是，将标记化的单词列表传递给名为 Dictionary.doc2bow() 的对象。因此，首先，让我们从使用简单的文档列表创建 BoW 语料库开始。

从简单的句子列表

在以下示例中，我们将从包含三个句子的简单列表创建 BoW 语料库。

首先，我们需要导入所有必要的包，如下所示：

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess

现在提供包含句子的列表。我们的列表中有三个句子：

doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]

接下来，对句子进行标记化，如下所示：

doc_tokenized = [simple_preprocess(doc) for doc in doc_list]

创建一个 corpora.Dictionary() 对象，如下所示：

dictionary = corpora.Dictionary()

现在将这些标记化的句子传递给 dictionary.doc2bow() 对象，如下所示：

BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]

最后，我们可以打印词袋语料库：

print(BoW_corpus)

输出

[
   [(0, 1), (1, 1), (2, 1), (3, 1)], 
   [(2, 1), (3, 1), (4, 2)], [(0, 2), (3, 3), (5, 2), (6, 1), (7, 2), (8, 1)]
]

以上输出显示，ID 为 0 的单词在第一个文档中出现一次（因为我们在输出中获得了 (0,1)），依此类推。

以上输出对于人类来说有点难以阅读。我们也可以将这些 ID 转换为单词，但为此我们需要使用我们的词典进行转换，如下所示：

id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus]
print(id_words)

输出

[
   [('are', 1), ('hello', 1), ('how', 1), ('you', 1)], 
   [('how', 1), ('you', 1), ('do', 2)], 
   [('are', 2), ('you', 3), ('doing', 2), ('hey', 1), ('what', 2), ('yes', 1)]
]

现在以上输出对人类来说更容易理解了。

完整的实现示例

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]
doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)
id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus]
print(id_words)

从文本文件

在以下示例中，我们将从文本文件创建 BoW 语料库。为此，我们将前面示例中使用的文档保存在名为 doc.txt 的文本文件中。

Gensim 将逐行读取文件，并使用 simple_preprocess 逐行处理。这样，它不需要一次将整个文件加载到内存中。

实现示例

首先，导入所需的和必要的包，如下所示：

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

接下来，以下代码行将读取 doc.txt 中的文档并将其标记化：

doc_tokenized = [
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
]
dictionary = corpora.Dictionary()

现在我们需要将这些标记化的单词传递到 dictionary.doc2bow() 对象中（如前例所示）

BoW_corpus = [
   dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized
]
print(BoW_corpus)

输出

[
   [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], 
   [
      (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), 
      (22, 1), (23, 1), (24, 1)
   ], 
   [
      (23, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), 
      (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)
   ], 
   [(3, 1), (18, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)], 
   [
      (18, 1), (27, 1), (31, 2), (32, 1), (38, 1), (41, 1), (43, 1), 
      (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1)
   ]
]

doc.txt 文件包含以下内容：

CNTK，以前称为计算网络工具包，是一个免费、易于使用、开源的商用级工具包，使我们能够训练深度学习算法，使其像人脑一样学习。

您可以在 tutorialspoint.com 上找到其免费教程，该网站还免费提供有关人工智能、深度学习、机器学习等技术的最佳技术教程。

完整的实现示例

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
doc_tokenized = [
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)

保存和加载 Gensim 语料库

我们可以使用以下脚本保存语料库：

corpora.MmCorpus.serialize(‘/Users/Desktop/BoW_corpus.mm’, bow_corpus)

#提供语料库的路径和名称。语料库的名称为 BoW_corpus，我们将其保存为矩阵市场格式。

类似地，我们可以使用以下脚本加载保存的语料库：

corpus_load = corpora.MmCorpus(‘/Users/Desktop/BoW_corpus.mm’)
for line in corpus_load:
print(line)

打印页面