Font Size: a A A

Research Of Parallel LDA Topic Model Based On Spark

Posted on:2017-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaoFull Text:PDF
GTID:2348330503965914Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and Internet,people are able to obtain more and more information,the scale of the data is also increase dramatically, has rised from the initial GB to TB or envel level of PB.Even the data has a great potential value, but because of the data size increases the processing difficulty. So how to quickly and efficiently obtain the useful information have become the problems to be solved.Latent Dirichlet Allocation(LDA) is a topic model for text processing, which maps the documents to a low dimensional topic space to achieve the goal of document analysis. AD-LDA(Approximate Distributed LDA) is a parallelized LDA algorithm implemented by Gibbs sampling algorithm, adopting the idea of global synchronization. convergencing to get the global model parameter after each iteration.Since the sampling process can not update the global sampling parameters in time, lead to a little loss of the precision of the final result comparing to standard LDA algorithm.In this paper, the main work is summarized as follows:(1) Through the research of topic model, proposing an improved method of parallel algorithm based on the AD-LDA.In the data segmentation process, introducing TF-IDF(Term Frequency-Inverse Document Frequency) algorithm to calculate the similarity between the texts. And the document which has high similarity will be assigned to the same data block,which is able to weak the interdependence between the data blocks and reduces the loss of the precision of the parallel algorithm.(2) In order to improve the ability of AD-LDA algorithm to deal with huge amounts of data, this paper consider the algorithm to run under the distributed framework. Spark is a memory-based distributed computing framework, and it has all the advantages of Hadoop MapReduce and can better apply to data mining and machine learning algorithm which need multiple iterations. So this paper choose Spark framework to implement the algorithm.At last this paper use the classic data set to do the experiments to compare the erplexity, convergence speed and speedup of different algorithms.The results show that the improved algorithm is closer to the standard LDA model on the perlexity and convergence speed.And the algorithm achieved good results in the acceleration of large data environments.
Keywords/Search Tags:Topic Model, LDA, Parallel, Spark
PDF Full Text Request
Related items