Modeling And Implementation Of Chinese Text Categorization System Based On SVM

Posted on:2007-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:H W Ma

Full Text:PDF

GTID:2178360182460620

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of communication and Internet, various information increases exponentially. Text, the most typical information carrier, can not make an exception. In order to control and retrieve valuable information, research of automatic text categorization (TC) becomes very important, and support vector machine (SVM) based text categorization is a hot field of research. Therefore, designing a TC platform based on SVM is very helpful to promote the research of Chinese text categorization.In this paper, a Chinese text categorization system has been implemented on the Microsoft Windows 2000, Visual C++ 6.0 and MS SQL Server 2000 platform. Two real examples test verifies the good performance of this system.In order to loosen the coupling among different models of the system, database is introduced into the system, and based on which many statistics and calculations are performed very effectively. Furthermore, the system can change algorithms in the models and take experiments to compare different algorithms.The modeling and implementation of the system includes four main parts, that is, text feature extraction, vector space modeling, SVM machine learning and multi-class categorization based on SVM. In the first part, four methods are compared. They are document frequency (DF) based, x~2 -test (CHI) based, mutual information (MI), based and information gain (IG) based. In the test IG based method performances the best in my system. In the second part, TFIDF algorithm has been implemented. In the SVM learning part, the Matlab software is used to solve the optimization problem and one-vs-rest method is used to perform multi-class categorization, which has been proved to obtain a good performance on both precision and recall.In the training test about 2000 Chinese texts of 10 classes are collected. And to test the classifier, about nearly 1000 other texts were used. The training and categorization tests show a good result for this system. The precision and recall are respectively about 97.84% and 89.93%, which is superior to some traditional text classifiers, take Riocchio, KNN for example.

Keywords/Search Tags:

Text Categorization, Support Vector Machine, Feature Extraction, Vector Space Model

PDF Full Text Request

Related items

1	The Research And Implementation Of Chinese Text Categorization
2	Research On Support Vector Machines Classification Algorithm In Text Categorization
3	Design And Implementation Of The Technical Text Categorization System
4	Study On Text Categorization Method Based On Support Vector Machine
5	Study On Text Category Oriented Chinese Text Mining And Its Implementation
6	Research And Implementation Of Chinese Text Categorization System Based On Semantic Similarity
7	Research On Chinese Text Categorization Based On Support Vector Machine
8	Research Of Text Categorization Based On Vector Space Model
9	The Research On Text Categorization Algorithm Based On Support Vector Machine
10	Text Classification Technology And Applied Research