Font Size: a A A

Modeling And Implementation Of Chinese Text Categorization System Based On SVM

Posted on:2007-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:H W MaFull Text:PDF
GTID:2178360182460620Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of communication and Internet, various information increases exponentially. Text, the most typical information carrier, can not make an exception. In order to control and retrieve valuable information, research of automatic text categorization (TC) becomes very important, and support vector machine (SVM) based text categorization is a hot field of research. Therefore, designing a TC platform based on SVM is very helpful to promote the research of Chinese text categorization.In this paper, a Chinese text categorization system has been implemented on the Microsoft Windows 2000, Visual C++ 6.0 and MS SQL Server 2000 platform. Two real examples test verifies the good performance of this system.In order to loosen the coupling among different models of the system, database is introduced into the system, and based on which many statistics and calculations are performed very effectively. Furthermore, the system can change algorithms in the models and take experiments to compare different algorithms.The modeling and implementation of the system includes four main parts, that is, text feature extraction, vector space modeling, SVM machine learning and multi-class categorization based on SVM. In the first part, four methods are compared. They are document frequency (DF) based, x~2 -test (CHI) based, mutual information (MI), based and information gain (IG) based. In the test IG based method performances the best in my system. In the second part, TFIDF algorithm has been implemented. In the SVM learning part, the Matlab software is used to solve the optimization problem and one-vs-rest method is used to perform multi-class categorization, which has been proved to obtain a good performance on both precision and recall.In the training test about 2000 Chinese texts of 10 classes are collected. And to test the classifier, about nearly 1000 other texts were used. The training and categorization tests show a good result for this system. The precision and recall are respectively about 97.84% and 89.93%, which is superior to some traditional text classifiers, take Riocchio, KNN for example.
Keywords/Search Tags:Text Categorization, Support Vector Machine, Feature Extraction, Vector Space Model
PDF Full Text Request
Related items