Font Size: a A A

Research And Implementation On Large-Scale Distributed LDA Topic Model

Posted on:2019-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhouFull Text:PDF
GTID:2428330572957274Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Topic Models are the Probability generative model that can automatically extract implicit semantic topics from large-scale corpus.Because of its good interpretability,gradually Topic model becomesthe important research topic in machine learning,natural language processing,computer vision,and is widely used in the text clustering,hot mining and sentiment analysis,information retrieval,recommendation system,and other fields.LDA(latent Dirchlet allocation)is a topic of the most widely used in the model a model.However,with the explosive growth of network data,analysis of the corpus scale is more and more big,the need to analyze the theme of the topic model number is becoming more and more big,how to accurately and efficiently training such a large corpus to be a big problem in academia and industry.In this context,this paper focuses on the research and implementation of the improved algorithm of single-machine Gibbs sampling algorithm and the parallel algorithm of large-scale distributed LDA topicmodels.Aimming at LDA topic model Gibbs sampling high complexity,and the characteristics of slow convergence speed,in the thorough analysis the SparseLDA,AliasLDA and LightLDA sampling mode and defect,put forward a kind of brand-new Gibbs sampling formula ZenLDA decomposition method.In the sampling process,the Alias Table and cumulative distribution of each word item were constructed,and the computational complexity of Gibbs sampling was reduced to O(Kd),and the sampling complexity was reduced to O(log Kd).The computational complexity and sampling complexity of ZenLDA are significantly better than that of SparseLDA,AliasLDA and LightLDA.The experimental results show that ZenLDA significantly improves the convergence speed of the model under the premise of ensuring the quality of model learning.The large-scale LDA theme model faces outstanding computational performance problems,so it is necessary to study the LDA theme model algorithm that realizes distribution parallelization.Three problems of the existing LDA parallelization algorithm:1)every sampling machine memory has always been to maintain a complete "word-theme" matrix,the memory of the machine configuration can severely limit the scale of the LDA training model;2)after each iteration,a large amount of communication is.required between the machines;3)after each iteration,a large amount of time is needed to synchronize the global model parameters and reduce the convergence rate of LDA model.For the above problems,this paper designs a distributed pipeline LDA parallel algorithm MPI-ZenLDA based on MPI.The algorithm firstly could be divided into sampling for computing machines machines and communication oriented communication,avoid the global model of the synchronous after each round of iteration,and reduces the communication cost and memory consumption;Then,the algorithm reasonably designed the distributed pipeline algorithm to make the sampling time hidden within the communication time,greatly increasing the speed of each iteration.At the same time,the improved ZenLDA sampling algorithm is used to replace the original standard gibbs sampling algorithm,which further improves the speed of sampling iteration.The experimental results show that the MPI-ZenLDA algorithm is significantly better than the existing AD-LD A algorithm in the model convergence speed and acceleration performance.Further,by analyzing the limitations of MPI-ZenLDA,the paper concludes that the MPI-ZenLDA algorithm has the problem of low quality of learning and the failure to make full use of computational resources.In this paper,a distributed ZenLDA sampling algorithm based on ParameterServer is proposed in this paper:Petuum-ZenLDAparallel algorithm.Petuum-ZenLDAalgorithm takes full advantage of the bounded asynchronous parallel framework and SSP thoughts of ParameterServer to solve the problem of low quality of model learning.Moreover,the Petuum-ZenLDAalgorithm designed the flow data block read and write and the model slice strategy on the sampling machine,and the large-scale thematic model can be trained when the number of machines is small.Experiments show that the Petuum-ZenLDAalgorithm is better than MPI-ZenLDAalgorithm and LightLDA algorithm in the model learning quality,convergence speed and acceleration performance.
Keywords/Search Tags:Collapse Gibbs Sampling, Pipeline Technology, Parameter Sever, Topic Model, Parallelization
PDF Full Text Request
Related items