Font Size: a A A

Massive Text Classification Parallelization Technology Based On Support Vector Machine

Posted on:2017-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y T RenFull Text:PDF
GTID:2308330503458929Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the data in the network increases rapidly at an unprecedented pace. As one of the main forms of data, text contains rich information. Text classification is an important part of nature language processing(NLP). A lot of potential value can be mined in text data with text classification. Data mining technology is always dealing with massive data. Execution speed is a main factor of this kind of problem. Parallel algorithms can improve the execution speed. Graphic processing units or computer clusters are used to improve the performance.This paper aims at the classification accuracy and calculation speed in massive text classification, analyzing the text classification and parallel technologies. The preprocess in Chinese text processing is introduced in this paper. Different classification algorithms are compared and optimizing algorithms are introduced. This paper chooses GPU, Hadoop and Spark as the parallel computing platforms to analyze.In order to improve the classification accuracy, this paper proposes an iteration evolution algorithm of text feature space. This algorithm can fix the defects in the feature space and improve the accuracy of classification. In additional, particle swarm optimization(PSO) algorithm is used to tune the parameters of SVM in RBF kernel. Parallel algorithms are implemented to improve the calculation speed. The preprocessing is implemented in parallel method, as well as the SVMs in linear and RBF kernel. GPU, Hadoop and Spark improve the speed of classification and increase the affordable data scale.The experiment in this paper contains the news classification and microblog sentiment classification. These two datasets have their own features. The feature space evolution algorithms has a good effect in news classification, recognizing the confusing categories. The accuracy of microblog sentiment classification improves with the attaching comments and PSO algorithms in RBF SVM. The serial algorithms take very long time to finish the classification, while the parallel algorithms performs in short time that can be accepted. The experiment results also infer the proposed algorithms improve the accuracy of classification.
Keywords/Search Tags:text classification, support vector machine, parallel computing, feature space
PDF Full Text Request
Related items