| With the rapid development of communication, computering and of especially internet, all kinds of information has grown geometrically. So does text as information carrier. In order to pick up valid information from the massive and complicated text, timely and accurately, text presentation and automatic text categorization technology have received widespread attention. Text categorization is very helpful for effectiveness and efficiency of information retrieval, which promotes personalized service and improves Information acquisition mode. So good Classification performance is the focus, and text categorization algorithm based on SVM is more more research focus.First, the dissertation analyzes the overall model for text classification, text representation and key Technology for text categorization. In feature extraction, several different methods of feature selection are compared such as document frequency, CHI distribution, information gain as well as mutual information. It is proved that the method of feature selection based on IG is better than other methods. In text representation, vector space model is implemented by using TF/IDF. And among multi-classification algorithms, one versus others algorithm is used and the results are quite satisfactory.The dissertation focuses on the statistical learning theory, probes in support vector machine algorithm based on it, and expounds on the current status of research and application of support vector machines, as well as the problems faced. Furthermore, the author analyzes and disscuss on training and classification algorithm of SVM, as well as hot issues such as the algorithms for solving large problems. The dissertation proposes a parallel SVM classification algorithm-PCSMO-KNN coupled with SVMQP idea to cope with bottlenecks, namely the computation time and memory as for the massive and confused text classification problems. The algorithm assigns the massive text into many parallel processors, trains them by CSMO algorithm, and then weighs the SV sets in feature space by KNN. The algorithm makes full use of the advantages of combined classifiers to compromise training speed and precision in better way. And it is proved by experiment that the algorithm greatly enhances the training speed and accuracy of mass text classification, and solves effectively bottleneck problems when there are more SVs.In addition, the author describes the design of a Chinese classification system based on the improved algorithm after studying key technologies for text classification and SVM deformation algorithm, and the system is simulated by experiment under certain conditions. Finally the better classification effect is achieved by using training sets and test sets to train and test the classifier and the system has solved the bottleneck problems of massive text classification based on SVM in a certain extent. |