With the rapid development of Internet, an unprecedented amount of informationpeople can get is magnified, and how to effectively deal with the problems of today’sinformation explosion has become a hot issue today. Text information as the traditionalmedia, and efficient management and mining the text information is essential. By computerautomatic classification of text, you can better manage the text of confidence, and thereforehas received extensive attention. Attention with the increasing amount of information ofChinese, the management of Chinese text, practical Chinese text categorization systemdesigned to achieve a meaningful research directions and topics.Scholars at home and abroad for text classification mainly focus on the traditionaldiscriminant algorithm, such as: artificial neural networks, nearest neighbor algorithm,Na ve Bayes, decision trees, as well as centre classification methods, etx. But the traditionalclassification algorithms exist over-fitting, and robustness is not strong enough defects.Support Vector Machines (SVM) to distinguish it from the traditional classificationalgorithms, it is in the training process not only consider the experience of the classifiererror, taking into account the experience of error confidence that generalization error, as theevaluation criteria of the classifier accurate estimate of the small sample classificationhyperplane avoid serious over-fitting phenomenon. SVM uses the kernel function ofideology at the same time, the original data is mapped into a high dimensional vector spaceto deal with linear inseparable problems in the original data space, and further enhance therobustness of the classifier. Strictly speaking, belong to a special kind of artificial neuralnetwork algorithm, SVM based on statistical learning theory of structural risk minimizationprinciple is better than the traditional artificial neural network algorithm, as well as otherclassification algorithms innovation.Text based on modular thinking in Visual Studio2005platform designed andimplemented a support vector machine-based Chinese text classification system. Thisarticle system is divided into two large blocks of text pre-processing and SVMclassification. Text preprocessing module is said by the text segmentation module and thetext that the two sub-modules which text segmentation module using the latest two-way matching demand disambiguation Chinese word segmentation algorithm in order to achievebetter segmentation effect, text segmentation module using the vector representation modelcan be effective mathematical representation of the text. SVM classification moduleconsists of two sub-modules of the training of SVM and SVM test, the SVM trainingsubmodule after pretreatment of the Chinese text classifier training, the use of the trainedclassifier, we can enable automatic classification based on SVM principle, SVM testsub-module treat the classification of Chinese text classification operation to achieve thepurpose of automatic text categorization.This paper describes the details of design ideas, and each module of the system toachieve, and from different segmentation algorithms, different classification strategy andthe three angles of the classification algorithm In this paper, the testing and evaluation. Testthe Chinese corpus collated by the Fudan University, Dr. Li Ronglu Chinese corpus, thecorpus collected10different categories including economic, environmental, medicine,politics, sports, art, computer, military, education, and transport of Chinese text of morethan2,000. Experimental analysis, we selected a classification strategy for optimalperformance as the preferences of the system in this article, experimental resultsdemonstrate the practicality and effectiveness of the system in this article. |