Font Size: a A A

A Fast Parallel Topic Modeling Algorithm In Shared Memory System

Posted on:2016-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:X YangFull Text:PDF
GTID:2308330464953267Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the requirement of timeliness is growing in large-scale data processing. Parallel topic modeling has been recognized as one of the most e?ective method to deal with the unstructured data. Latent Dirichlet Allocation(LDA) is one of the most popular probabilistic topic models which maps the documents to a low dimensional space(topic space) to achieve the goal of document analysis. However, two problems can not be ignored when the parallel LDA algorithm is used to handle the large-scale data. First, many disadvantages of the existing approximate inference algorithms including the belief propagation(BP) make it di?cult to be applied in large-scale data. Through exceeding in accuracy and convergence time, BP’s large space complexity becomes the main obstacle during processing and analysis of the large-scale data. Second, the study of parallel LDA algorithm in shared memory is not enough and the locking problem among multiple threads remains unsolved. How to improve the existing approximate algorithm and reduce the locking time among multiple threads still remain a challenge.When a word is updated, messages of related words and documents will be gathered and stored which leads to the large space complexity. In this paper, a new belief propagation algorithm(EBP) is proposed for the perspective of ExpectationMaximization, which helps to avoid the problem of large space complexity.Traditional parallel LDA algorithm in shared memory always fails to make full use of computing resources, a dynamic scheduling parallel method is proposed in this paper to solve the problem. The process of parallel can be regarded as assigning work for threads. Through the dynamical scheduling of threads, the locking time in multiple threads is reduced. With the combination of the improved parallel algorithm and EBP,we propose a fast parallel topic modeling algorithm in shared memory system names PEBP.As the experiment shows, EBP is better than BP and GS in perplexity and convergence time. The dynamic scheduling of threads is faster in speed up and scale up than the common ones. The performance in perplexity and convergence time of PEBP is also better than normal parallel LDA algorithms.
Keywords/Search Tags:Latent Dirichlet allocation, belief propagation, shared memory, parallel, dynamic scheduling
PDF Full Text Request
Related items