Research On Text Classification System Based On Support Vector Machine

Posted on:2007-12-14

Degree:Master

Type:Thesis

Country:China

Candidate:Z B Ma

Full Text:PDF

GTID:2178360182980261

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Nowadays technology of information has developed quickly, many people will read and learn much Chinese information in daily life, especially on the Internet where millions of Chinese web pages exist. How to obtain useful information quickly and efficiently by computer has become a hotspot of study in the field of Chinese information processing. Chinese text classification has great influence to Chinese information processing and been applied in many fields, such as text identification, E-governance, search engine and information filtration.Support Vector Machine (SVM) is a new pattern recognition method developed in recent years based on statistical learning theory. It was first propounded by Boser, Guyon and Vapnik on COLT-92 and has successful application in text classification, image recognition and biological information processing. Compared with traditional method of classification, SVM shows many attractive features and emphatic performance in the fields of small sample, nonlinear and high dimensional pattern recognition. SVM is under the principle of structural risk minimization and has best overall solver. Classifier based on SVM can be provided with good outreach capacity and achieve high accuracy rate even with small sample.Text classification refers to judge the category of new text according to the given definitions of the categories. Automatic Chinese text classification needs word segmentation, which is different from English. In this paper, Chinese word segmentation is introduced first, and then algorithm named two-way matching term is designed, which effectively reduces the ambiguity of the Chinese words. Feature selection is an important link of text classification and after some of traditional algorithms of feature selection are analyzed, we propound improving strategies to algorithms of mutual information and Chi-square Statistic.Design of classifier is the core of text classification system. Current methods of classifier designing are discussed, especially the research on support vector machine, such as linear and nonlinear SVM, and classification results with different kernel function are compared. Through training process analysis for the general classification, the training data set selection of text classification is discussed andan algorithm named dynamic training data set is presented, which enhance the role of training data set in text classification in the training and studying process.Finally, with combination of text classification and method of support vector machine, a text classification system is designed and implemented. We use the common indicators, such as precision, recall and F value to judge the result of the text classification system. Experimental results show that the overall average of the system's indicators is high and the system has good result of classification.

Keywords/Search Tags:

Text Classification, Chinese Word Segmentation, Feature Selection, Support Vector Machine

PDF Full Text Request

Related items

1	Research And Implementation Of Chinese Automatic Text Classification System Based On SVM
2	Design And Implementation Of Web Automatic Text Categorization
3	The Studies On Chinese Text Categorization Based On Pso And Svm
4	The Study And Implementation Of Chinese Web Text Classification
5	Automatic Classification Research On Chinese Web Document Orientation
6	Research On Chinese Text Classification System Based On Support Vector Machine
7	Chinese Text Classification Algorithm
8	Chinese Text Data Classification
9	Research On Word Segmentation And Feature Selection Of Chinese Text Chinese Text Classification
10	Chinese Text Classification Based On Svm Algorithm Realization