Font Size: a A A

Research On Text Classification Parallelization Method Based On LDA

Posted on:2018-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2348330542972266Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and information technology,data in the network has increased at an unprecedented rate.Unstructured data plays a major role in all kinds of data.The research on how to mining potential information from texts through text classification techniques has important significance.In recent years,the LDA model has attracted wide attention in the text mining area for its good ability of text expression.During this period,some parallel LDA models have been proposed.But in mass data,the communication cost and space complexity of the existing parallel LDA models are too high.How to rapidly and accurately conduct the document representation and realize the parallelization of text classification based on it are the key issues in the current study.Focusing on how to improve precision and efficiency of text classification,this paper improves the text representation and text classification phases.In the text representation phase,the paper proposes a new parallel LDA model to solve the problem that the communication cost of the existing models is high.Based on the Zip's law,we present a partitioning strategy of the model communication frequency to reduce the communication cost in the parallel training.Meanwhile we put forward a word-weight algorithm based on Gaussian function.Without the loss of text representation precision,this algorithm can enhance the training efficiency of the LDA model.In the text classification phase,this paper fulfills the text categorization task using the SVM model with optimal performance.In order to improve the precision of text classification,this paper adopts particle swarm optimization algorithm to tune the parameter of Gaussian kernel among the SVM.In addition,by analyzing the Cascade SVM algorithm,this paper promotes an improved parallel algorithm to improve the classification efficiency.Combining with the current mainstream parallel framework Spark,the algorithm improves the training efficiency of the model effectively.The experimental results show that the parallel LDA algorithm proposed in this paper reduces the communication cost without the loss of text representation precision.Compared with the SVM with a default parameter,the SVM using particle swarm optimization algorithm to tune the parameter can improve the classification precision effectively.Based on the parallel LDA text representation,with the same precision,the improved Cascade SVMalgorithm improves the classification efficiency.The above three contrast experiments demonstrate the effectiveness of the proposed algorithm.
Keywords/Search Tags:Text Classfication, LDA Model, SVM, Parallelization
PDF Full Text Request
Related items