Research Of Parallel LDA Topic Model Based On Spark

Posted on:2017-05-13

Degree:Master

Type:Thesis

Country:China

Candidate:J Xiao

Full Text:PDF

GTID:2348330503965914

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology and Internet,people are able to obtain more and more information,the scale of the data is also increase dramatically, has rised from the initial GB to TB or envel level of PB.Even the data has a great potential value, but because of the data size increases the processing difficulty. So how to quickly and efficiently obtain the useful information have become the problems to be solved.Latent Dirichlet Allocation(LDA) is a topic model for text processing, which maps the documents to a low dimensional topic space to achieve the goal of document analysis. AD-LDA(Approximate Distributed LDA) is a parallelized LDA algorithm implemented by Gibbs sampling algorithm, adopting the idea of global synchronization. convergencing to get the global model parameter after each iteration.Since the sampling process can not update the global sampling parameters in time, lead to a little loss of the precision of the final result comparing to standard LDA algorithm.In this paper, the main work is summarized as follows:(1) Through the research of topic model, proposing an improved method of parallel algorithm based on the AD-LDA.In the data segmentation process, introducing TF-IDF(Term Frequency-Inverse Document Frequency) algorithm to calculate the similarity between the texts. And the document which has high similarity will be assigned to the same data block,which is able to weak the interdependence between the data blocks and reduces the loss of the precision of the parallel algorithm.(2) In order to improve the ability of AD-LDA algorithm to deal with huge amounts of data, this paper consider the algorithm to run under the distributed framework. Spark is a memory-based distributed computing framework, and it has all the advantages of Hadoop MapReduce and can better apply to data mining and machine learning algorithm which need multiple iterations. So this paper choose Spark framework to implement the algorithm.At last this paper use the classic data set to do the experiments to compare the erplexity, convergence speed and speedup of different algorithms.The results show that the improved algorithm is closer to the standard LDA model on the perlexity and convergence speed.And the algorithm achieved good results in the acceleration of large data environments.

Keywords/Search Tags:

Topic Model, LDA, Parallel, Spark

PDF Full Text Request

Related items

1	Research On The Implementation Of Bursty Events Detection Based On Spark
2	Research And System Design Of Hot Topic Discovery Method Based On Microblog Data Flow
3	Research On Parallel Mining Algorithm Of Association Pattern Based On Spark
4	Design And Implementation Of Advertising Push System Based On Spark
5	Research On Topic Detection And Tracking Technology Based On Spark
6	Research And Application Of Hot Topic Recognition And Evolution Analysis For Mobile Complaint Text
7	Design And Implementation Of Forum Data Analysis Platform Based On SPARK
8	A Study And Implementation Of Web Text Mining System Based On Spark
9	Design And Implementation Of Parallel Data Mining System Based On Spark
10	Research And Improvement Of Big Data Parallel Clustering Algorithm Based On Spark