python - How to get a complete topic distribution for a document using gensim LDA? -
when train lda model such
dictionary = corpora.dictionary(data) corpus = [dictionary.doc2bow(doc) doc in data] num_cores = multiprocessing.cpu_count() num_topics = 50 lda = ldamulticore(corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1)
i want full topic distribution num_topics
each , every document. is, in particular case, want each document have 50 topics contributing distribution and want able access 50 topics' contribution. output lda should if adhering strictly mathematics of lda. however, gensim outputs topics exceed threshold shown here. example, if try
lda[corpus[89]] >>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]
which shows 3 topics contribute document 89. have tried solution in link above, not work me. still same output:
theta, _ = lda.inference(corpus) theta /= theta.sum(axis=1)[:, none]
produces same output i.e. 2,3 topics per document.
my question how change threshold can access full topic distribution each document? how can access full topic distribution, no matter how insignificant contribution of topic document? reason want full distribution can perform kl similarity search between documents' distribution.
thanks in advance
it doesnt seem has replied yet, i'll try , answer best can given gensim documentation.
it seems need set parameter minimum_probability
0.0 when training model desired results:
lda = ldamulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1, minimum_probability=0.0) lda[corpus[233]] >>> [(0, 5.8821799358842424e-07), (1, 5.8821799358842424e-07), (2, 5.8821799358842424e-07), (3, 5.8821799358842424e-07), (4, 5.8821799358842424e-07), (5, 5.8821799358842424e-07), (6, 5.8821799358842424e-07), (7, 5.8821799358842424e-07), (8, 5.8821799358842424e-07), (9, 5.8821799358842424e-07), (10, 5.8821799358842424e-07), (11, 5.8821799358842424e-07), (12, 5.8821799358842424e-07), (13, 5.8821799358842424e-07), (14, 5.8821799358842424e-07), (15, 5.8821799358842424e-07), (16, 5.8821799358842424e-07), (17, 5.8821799358842424e-07), (18, 5.8821799358842424e-07), (19, 5.8821799358842424e-07), (20, 5.8821799358842424e-07), (21, 5.8821799358842424e-07), (22, 5.8821799358842424e-07), (23, 5.8821799358842424e-07), (24, 5.8821799358842424e-07), (25, 5.8821799358842424e-07), (26, 5.8821799358842424e-07), (27, 0.99997117731831464), (28, 5.8821799358842424e-07), (29, 5.8821799358842424e-07), (30, 5.8821799358842424e-07), (31, 5.8821799358842424e-07), (32, 5.8821799358842424e-07), (33, 5.8821799358842424e-07), (34, 5.8821799358842424e-07), (35, 5.8821799358842424e-07), (36, 5.8821799358842424e-07), (37, 5.8821799358842424e-07), (38, 5.8821799358842424e-07), (39, 5.8821799358842424e-07), (40, 5.8821799358842424e-07), (41, 5.8821799358842424e-07), (42, 5.8821799358842424e-07), (43, 5.8821799358842424e-07), (44, 5.8821799358842424e-07), (45, 5.8821799358842424e-07), (46, 5.8821799358842424e-07), (47, 5.8821799358842424e-07), (48, 5.8821799358842424e-07), (49, 5.8821799358842424e-07)]
Comments
Post a Comment