Font Size: a A A

The Research And Implementation About Parallel Latent Dirichlet Allocation

Posted on:2016-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z L QiuFull Text:PDF
GTID:2298330467992891Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and Internet, the amount of data generated every day shows explosive growth, and seek effective data processing has become the urgent needs in the era of big data.Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. In this paper, we make the following three points in the parallel study of LDA topic model:We implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution.We do our experiments in perplexity, convergence speed and speedup. The results show that the algorithm ensured the accuracy and achieved good results in the acceleration of large data environments.We describe the parallel data mining tool named BC-PDM, the tool is integrated many commonly used data mining algorithms on Hadoop and Spark, including LDA algorithm mentioned in this paper. Then, we show a detailed process to find the topic model using the LDA algorithm in BC-PDM.
Keywords/Search Tags:Bigdata, Spark, Gibbs Sampling, LDA
PDF Full Text Request
Related items