Font Size: a A A

Topic Model Based On Dirichlet Process

Posted on:2020-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2428330605974760Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Topic model is a powerful text processing technology which is widely used in many fields.Topic model has been continuously developed,and now has a relatively complete theory and effective solutions for different problems.Although topic model is a relatively mature text processing tool,there are still some defects in topic model,such as online LDA algorithms have fixed vocabulary leads that the words from streams which are not included in the vocabulary cannot be effectively processed.In this paper,incremental vocabulary Latent Dirichlet Allocation(ivLDA)model is proposed to solve this problem.Based on the theory of ivLDA,two algorithms are proposed:ivLDA-Perp which performs well on Perplexity and ivLDA-PMI which performs well on PMI index.Two algorithms focus on different application fields and provide solutions for different problems.The main contributions of this paper are summarized as follows:1)To solve the drawbacks of online LDA algorithms which have fixed vocabulary,ivLDA model is proposed.ivLDA solves the drawbacks of online LDA algorithms by using the Dirichlet process as its topic-word distribution.By applying the Dirichlet process,ivLDA has incremental vocabulary where the words are not included in the vocabulary can be added in.ivLDA does not need to predetermine the vocabulary before the algorithm running,when ivLDA encounters the words from streams that are not included in the vocabulary,ivLDA can add them into the vocabulary,and then process the streams based on the updated vocabulary.Compared to the online LDA algorithms,ivLDA has more advantage in the accuracy of the model.2)Based on the theory of ivLDA,a better construction scheme of Dirichlet process compared with dvOBP and infvoc-LDA is proposed,and the algorithm which uses this construction scheme namely ivLDA-Perp.The Dirichlet process is just a concept,and applying it in practice requires a reasonable implementation.Compared with the construction schemes of dvOBP and infvoc-LDA,ivLDA-Perp has a better Dirichlet process construction scheme which uses the uniform distribution as the base distribution of Dirichlet process,and ivLDA-Perp has a more reasonable method to give weights to the words in topic-word distribution.ivLD A-Perp pays more attention on the accuracy of the model.Experiments show that the ivLDA-Perp algorithm has a higher performance than the state-of-the-art algorithms.3)Based on the theory of ivLDA,a solution designed for topic duplication problem is proposed,which is better than infvoc-LDA,and the algorithm which uses this implementation namely ivLDA-PMI.ivLDA-PMI solves the topic duplication problem by redesigning the message update function,and ivLDA-PMI performs better than the infvoc-LDA which also proposes a solution to the topic duplication problem.Experiments show that compared with the state-of-the-art algorithms,ivLDA-PMI has a better performance on topic representation and shows more practical value.
Keywords/Search Tags:Topic model, Latent Dirichlet Allocation, Belief propagation, Dirichlet process, Topic duplication
PDF Full Text Request
Related items