Research On Text Classification Parallelization Method Based On LDA

Posted on:2018-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:L Chen

Full Text:PDF

GTID:2348330542972266

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet and information technology,data in the network has increased at an unprecedented rate.Unstructured data plays a major role in all kinds of data.The research on how to mining potential information from texts through text classification techniques has important significance.In recent years,the LDA model has attracted wide attention in the text mining area for its good ability of text expression.During this period,some parallel LDA models have been proposed.But in mass data,the communication cost and space complexity of the existing parallel LDA models are too high.How to rapidly and accurately conduct the document representation and realize the parallelization of text classification based on it are the key issues in the current study.Focusing on how to improve precision and efficiency of text classification,this paper improves the text representation and text classification phases.In the text representation phase,the paper proposes a new parallel LDA model to solve the problem that the communication cost of the existing models is high.Based on the Zip's law,we present a partitioning strategy of the model communication frequency to reduce the communication cost in the parallel training.Meanwhile we put forward a word-weight algorithm based on Gaussian function.Without the loss of text representation precision,this algorithm can enhance the training efficiency of the LDA model.In the text classification phase,this paper fulfills the text categorization task using the SVM model with optimal performance.In order to improve the precision of text classification,this paper adopts particle swarm optimization algorithm to tune the parameter of Gaussian kernel among the SVM.In addition,by analyzing the Cascade SVM algorithm,this paper promotes an improved parallel algorithm to improve the classification efficiency.Combining with the current mainstream parallel framework Spark,the algorithm improves the training efficiency of the model effectively.The experimental results show that the parallel LDA algorithm proposed in this paper reduces the communication cost without the loss of text representation precision.Compared with the SVM with a default parameter,the SVM using particle swarm optimization algorithm to tune the parameter can improve the classification precision effectively.Based on the parallel LDA text representation,with the same precision,the improved Cascade SVMalgorithm improves the classification efficiency.The above three contrast experiments demonstrate the effectiveness of the proposed algorithm.

Keywords/Search Tags:

Text Classfication, LDA Model, SVM, Parallelization

PDF Full Text Request

Related items

1	Research And Application Of KNN Classfication Algorithm Based On MapReduce
2	Research On Support Vector Machine Based Text Classfication
3	Research On Fitlteration And Classfication Methods Of Large-Scale Short Text
4	Research On Web Text Mining
5	Sparse Bayesian Model Based On Text Classfication
6	Research On Web Text Classification Algorithm Based On Parallelism
7	Research And Application Of Parallelization Of Text Classification Based On Improved Convolutional Neural Network
8	Extraction Of Chi-square Features In Chinese Text Classification And Improvement Of TF-IDF Weight
9	Text Mining And Its Application In Text Retrieval
10	Research And Application Of Mobile Phone Users Classfication Method Based On Characteristics Of Text