Font Size: a A A

An Improved AD-LDA Distributed Topic Model Based On Weighted Gibbs Samping

Posted on:2017-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q L LiangFull Text:PDF
GTID:2348330566456728Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and the Internet,the amount of data generated fiercely.Large-scale data have beyond existing computing technology and information systems.It is the people's urgen needs to search for effective techniques to large data mining model.LDA is the main topin model in field of text data mining and AD-LDA parallel implementation of LDA in distributed platform.In order to solve the problem of low efficiency of AD-LDA model in distributed platform,this paper propose a WAD-LDA model based on AD-LDA model,combined with the weighted sampling,excuted in Spark distributed platform.WAD-LDA can reduce the Gibbs samping time of AD-LDA model in distributed platform.In order to accurately calculate the IF(impact factor)of the feature word,we use the TF-IDF statistical method to calculate the word weights,which can reduce the impact of high-frequency words and accurately extract feature word,controlling the number of sampling words.This way can reduce the time of a single iteration and improve the efficiency of AD-LDA in the premise of ensuring the accuracy of the model.In this paper,we use the real-world dataset in the experiment,which is a subset of data of network access log in Beijing Institute of Technology.For comparison,we choose two state-of-art-model,Spark-LDA and AD-LDA,as our baseline.Experiments show the improved algorithm can enhance the efficiency in the premise of not losing much precision and the performance of WAD-LDA model is better than that of AD-LDA.
Keywords/Search Tags:Gibbs sampling, AD-LDA, Spark, Weighted computing
PDF Full Text Request
Related items