The Research And Implementation About Parallel Latent Dirichlet Allocation

Posted on:2016-12-05

Degree:Master

Type:Thesis

Country:China

Candidate:Z L Qiu

Full Text:PDF

GTID:2298330467992891

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology and Internet, the amount of data generated every day shows explosive growth, and seek effective data processing has become the urgent needs in the era of big data.Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. In this paper, we make the following three points in the parallel study of LDA topic model:We implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution.We do our experiments in perplexity, convergence speed and speedup. The results show that the algorithm ensured the accuracy and achieved good results in the acceleration of large data environments.We describe the parallel data mining tool named BC-PDM, the tool is integrated many commonly used data mining algorithms on Hadoop and Spark, including LDA algorithm mentioned in this paper. Then, we show a detailed process to find the topic model using the LDA algorithm in BC-PDM.

Keywords/Search Tags:

Bigdata, Spark, Gibbs Sampling, LDA

PDF Full Text Request

Related items

1	An Improved AD-LDA Distributed Topic Model Based On Weighted Gibbs Samping
2	Adaptive Gibbs Sampling Method Based On Network
3	Bigdata Cleaning Framework Design And Implementation Based On Spark
4	Research And Implementation On The Prediction Of Transcription Factor Binding Site Based On Gibbs Sampling Algorithm
5	Traffic Forecasts. Inference Algorithm Based On Gibbs Sampling
6	Position Dependencies Between Multinomial Distribution For Motif Discovery Based On Gibbs Sampling
7	Research On Motif Finding Algorithm Based On Gibbs Sampling
8	Local,Dynamic And Fast Algorithms For Sampling From Gibbs Distributions
9	The Design And Implementation Of Honeypot System Based On Spark
10	Design And Implementation Of Real-time Streaming Module Based On Spark Streaming