- Gensim 教程
- Gensim - 首页
- Gensim - 简介
- Gensim - 入门
- Gensim - 文档和语料库
- Gensim - 向量和模型
- Gensim - 创建词典
- 创建词袋(BoW)语料库
- Gensim - 变换
- Gensim - 创建TF-IDF矩阵
- Gensim - 主题建模
- Gensim - 创建LDA主题模型
- Gensim - 使用LDA主题模型
- Gensim - 创建LDA Mallet模型
- Gensim - 文档和LDA模型
- Gensim - 创建LSI和HDP主题模型
- Gensim - 开发词嵌入
- Gensim - Doc2Vec模型
- Gensim 有用资源
- Gensim - 快速指南
- Gensim - 有用资源
- Gensim - 讨论
Gensim - 创建LDA Mallet模型
本章将解释什么是潜在狄利克雷分配(LDA)Mallet模型以及如何在Gensim中创建它。
在上一节中,我们已经实现了LDA模型,并从20Newsgroup数据集的文档中获取了主题。那是Gensim内置的LDA算法版本。Gensim也存在Mallet版本,它提供了更好的主题质量。在这里,我们将对之前已经实现的示例应用Mallet的LDA。
什么是LDA Mallet模型?
Mallet是一个开源工具包,由Andrew McCullum编写。它基本上是一个基于Java的软件包,用于NLP、文档分类、聚类、主题建模以及许多其他机器学习文本应用。它为我们提供了Mallet主题建模工具包,其中包含LDA以及分层LDA的高效的基于采样的实现。
Mallet2.0是MALLET(Java主题建模工具包)的当前版本。在开始使用它与Gensim进行LDA之前,我们必须在系统上下载mallet-2.0.8.zip包并解压缩它。安装并解压缩后,将环境变量%MALLET_HOME%设置为指向MALLET目录,可以通过手动或我们将在实现LDA与Mallet时提供的代码来完成。
Gensim包装器
Python为潜在狄利克雷分配(LDA)提供了Gensim包装器。该包装器的语法为gensim.models.wrappers.LdaMallet。此模块(从MALLET折叠吉布斯采样)允许从训练语料库估计LDA模型,以及对新的、未见过的文档推断主题分布。
实现示例
我们将对之前构建的LDA模型使用LDA Mallet,并通过计算连贯性得分来检查性能差异。
提供Mallet文件路径
在将Mallet LDA模型应用于我们在前面示例中构建的语料库之前,我们必须更新环境变量并提供Mallet文件的路径。可以使用以下代码来完成:
import os from gensim.models.wrappers import LdaMallet os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'}) #You should update this path as per the path of Mallet directory on your system. mallet_path = r'C:/mallet-2.0.8/bin/mallet' #You should update this path as per the path of Mallet directory on your system.
一旦我们提供了Mallet文件的路径,我们现在就可以在语料库上使用它。可以使用ldamallet.show_topics()函数完成,如下所示:
ldamallet = gensim.models.wrappers.LdaMallet( mallet_path, corpus=corpus, num_topics=20, id2word=id2word ) pprint(ldamallet.show_topics(formatted=False))
输出
[ (4, [('gun', 0.024546225966016102), ('law', 0.02181426826996709), ('state', 0.017633545129043606), ('people', 0.017612848479831116), ('case', 0.011341763768445888), ('crime', 0.010596684396796159), ('weapon', 0.00985160502514643), ('person', 0.008671896020034356), ('firearm', 0.00838214293105946), ('police', 0.008257963035784506)]), (9, [('make', 0.02147966482730431), ('people', 0.021377478029838543), ('work', 0.018557122419783363), ('money', 0.016676885346413244), ('year', 0.015982015123646026), ('job', 0.012221540976905783), ('pay', 0.010239117106069897), ('time', 0.008910688739014919), ('school', 0.0079092581238504), ('support', 0.007357449417535254)]), (14, [('power', 0.018428398507941996), ('line', 0.013784244460364121), ('high', 0.01183271164249895), ('work', 0.011560979224821522), ('ground', 0.010770484918850819), ('current', 0.010745781971789235), ('wire', 0.008399002000938712), ('low', 0.008053160742076529), ('water', 0.006966231071366814), ('run', 0.006892122230182061)]), (0, [('people', 0.025218349201353372), ('kill', 0.01500904870564167), ('child', 0.013612400660948935), ('armenian', 0.010307655991816822), ('woman', 0.010287984892595798), ('start', 0.01003226060272248), ('day', 0.00967818081674404), ('happen', 0.009383114328428673), ('leave', 0.009383114328428673), ('fire', 0.009009363443229208)]), (1, [('file', 0.030686386604212003), ('program', 0.02227713642901929), ('window', 0.01945561169918489), ('set', 0.015914874783314277), ('line', 0.013831003577619592), ('display', 0.013794120901412606), ('application', 0.012576992586582082), ('entry', 0.009275993066056873), ('change', 0.00872275292295209), ('color', 0.008612104894331132)]), (12, [('line', 0.07153810971508515), ('buy', 0.02975597944523662), ('organization', 0.026877236406682988), ('host', 0.025451316957679788), ('price', 0.025182275552207485), ('sell', 0.02461728860071565), ('mail', 0.02192687454599263), ('good', 0.018967419085797303), ('sale', 0.017998870026097017), ('send', 0.013694207538540181)]), (11, [('thing', 0.04901329901329901), ('good', 0.0376018876018876), ('make', 0.03393393393393394), ('time', 0.03326898326898327), ('bad', 0.02664092664092664), ('happen', 0.017696267696267698), ('hear', 0.015615615615615615), ('problem', 0.015465465465465466), ('back', 0.015143715143715144), ('lot', 0.01495066495066495)]), (18, [('space', 0.020626317374284855), ('launch', 0.00965716006366413), ('system', 0.008560244332602057), ('project', 0.008173097603991913), ('time', 0.008108573149223556), ('cost', 0.007764442723792318), ('year', 0.0076784101174345075), ('earth', 0.007484836753129436), ('base', 0.0067535595990880545), ('large', 0.006689035144319697)]), (5, [('government', 0.01918437232469453), ('people', 0.01461203206475212), ('state', 0.011207097828624796), ('country', 0.010214802708381975), ('israeli', 0.010039691804809714), ('war', 0.009436532025838587), ('force', 0.00858043427504086), ('attack', 0.008424780138532182), ('land', 0.0076659662230523775), ('world', 0.0075103120865437)]), (2, [('car', 0.041091194044470564), ('bike', 0.015598981291017729), ('ride', 0.011019688510138114), ('drive', 0.010627877363110981), ('engine', 0.009403467528651191), ('speed', 0.008081104907434616), ('turn', 0.007738270153785875), ('back', 0.007738270153785875), ('front', 0.007468899990204721), ('big', 0.007370947203447938)]) ]
评估性能
现在我们还可以通过计算连贯性得分来评估其性能,如下所示:
ldamallet = gensim.models.wrappers.LdaMallet( mallet_path, corpus=corpus, num_topics=20, id2word=id2word ) pprint(ldamallet.show_topics(formatted=False))
输出
Coherence Score: 0.5842762900901401