Research On Large-scale Text Classification Based On SVM

Posted on:2008-04-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Yuan

Full Text:PDF

GTID:2178360215972133

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of communication, computering and of especially internet, all kinds of information has grown geometrically. So does text as information carrier. In order to pick up valid information from the massive and complicated text, timely and accurately, text presentation and automatic text categorization technology have received widespread attention. Text categorization is very helpful for effectiveness and efficiency of information retrieval, which promotes personalized service and improves Information acquisition mode. So good Classification performance is the focus, and text categorization algorithm based on SVM is more more research focus.First, the dissertation analyzes the overall model for text classification, text representation and key Technology for text categorization. In feature extraction, several different methods of feature selection are compared such as document frequency, CHI distribution, information gain as well as mutual information. It is proved that the method of feature selection based on IG is better than other methods. In text representation, vector space model is implemented by using TF/IDF. And among multi-classification algorithms, one versus others algorithm is used and the results are quite satisfactory.The dissertation focuses on the statistical learning theory, probes in support vector machine algorithm based on it, and expounds on the current status of research and application of support vector machines, as well as the problems faced. Furthermore, the author analyzes and disscuss on training and classification algorithm of SVM, as well as hot issues such as the algorithms for solving large problems. The dissertation proposes a parallel SVM classification algorithm-PCSMO-KNN coupled with SVMQP idea to cope with bottlenecks, namely the computation time and memory as for the massive and confused text classification problems. The algorithm assigns the massive text into many parallel processors, trains them by CSMO algorithm, and then weighs the SV sets in feature space by KNN. The algorithm makes full use of the advantages of combined classifiers to compromise training speed and precision in better way. And it is proved by experiment that the algorithm greatly enhances the training speed and accuracy of mass text classification, and solves effectively bottleneck problems when there are more SVs.In addition, the author describes the design of a Chinese classification system based on the improved algorithm after studying key technologies for text classification and SVM deformation algorithm, and the system is simulated by experiment under certain conditions. Finally the better classification effect is achieved by using training sets and test sets to train and test the classifier and the system has solved the bottleneck problems of massive text classification based on SVM in a certain extent.

Keywords/Search Tags:

Support vector machine, KNN algorithm, parallel technologies, weighed regression, text classification

PDF Full Text Request

Related items

1	The Research Of Distributed Parallel Support Vector Regression Machine Algorithm And Framework
2	Massive Text Classification Parallelization Technology Based On Support Vector Machine
3	Research On Multi-hyperplane Twin Support Vector Regression Algorithm And Its Optimization
4	Research On Parallel Text Classification Method Based On Support Vector Machine
5	Research On Support Vector Regression Algorithms And Its Application
6	Research On Engineering Applications Of Suppor Vector Machine
7	Research On Text Classification Algorithm Based On Support Vector Machine And Neural Network
8	Research On Twin Support Vector Machine Classification And Regression Algorithm
9	Research And Application Of Heterogeneous Weighted Support Vector Machine Algorithm
10	The Study Of Text Classification Based On Support Vector Machine