Font Size: a A A

Comparison And Improvement Studies Of Topic Model LDA Inference Algorithms

Posted on:2018-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhangFull Text:PDF
GTID:2348330542965248Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a popular topic model algorithm,Latent Dirichlet Allocation(LDA)clusters documents and words on topic layer to decompose high-dimensional sparse documentswords matrix into two relatively dense documents-topics matrix and topics-words matrix.Since Blei David proposed LDA in 2003,there have been three inference algorithms for LDA: Gibbs Sampling(GS),Variational Bayesian Inference(VB)and Expectation Maximization(EM).There are emerging variant algorithms aiming at various application contexts based on the three inference algorithms,such as the batch LDA algorithms for small data,the online LDA algorithms for big data and the accelerating algorithms for real-time processing.However,it is detected that there are still three unsolved problems in LDA,and in this thesis we study them correspondingly:1)The comparative research on the predictive abilities of the three LDA inference algorithms,with the specific problem embodied on the selection of inference algorithms in practice.Under the framework of entropy,this thesis re-understands the optimization object of LDA,the optimization objects of LDA inference algorithms and the perplexity(a popular metric for LDA).According to the research findings,compared with GS and VB,EM can achieve better predictive perplexity(a standard performance metric for LDA predictive abilities),by directly minimizing the cross entropy between the observed word distribution and LDA's predictive word distribution.2)The function research of LDA priors(Dirichlet hyper-parameters and number of topics)on the predictive abilities of LDA,with the specific problem embodied on the determination of the priors in practice.This thesis analyzes the influence of LDA priors on the model's predictive abilities from the aspect of entropy,finding that the adjustment on Dirichlet hyper-parameters and number of topics has influence on the entropy of the word distribution predicted by LDA,and has further influence on the predictive abilities of LDA.Based on the rules of the influence of Dirichlet hyper-parameters on the model's predictive abilities,this thesis proposes a grid searching based next-best hyper-parameters values searching algorithm.3)The convergence speed based research on LDA accelerating algorithms,with the specific problem embodied on the selection of the accelerating algorithms,i.e.,which accelerating algorithm can achieve the fastest convergence speed.For solving the drawbacks of FEM,this thesis proposes an EM based new accelerating algorithm AEM(Adaptive EM),with the core idea that the topics to be updated in each document are reduced in a self-adaptive way with the convergence of the model.Based on multiple datasets and different numbers of topics set by this thesis,AEM has the convergence speed 9% ~ 38.5%,4.1% ~ 15.5% and 11.7% ~ 43% faster than that of relatively advanced FEM,Alias LDA and SparseLDA,respectively.
Keywords/Search Tags:Latent Dirichlet Allocation, Inference Algorithms, Priors, Convergence Speed
PDF Full Text Request
Related items