Font Size: a A A

Research On Text Classification System Based On Support Vector Machine

Posted on:2007-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z B MaFull Text:PDF
GTID:2178360182980261Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays technology of information has developed quickly, many people will read and learn much Chinese information in daily life, especially on the Internet where millions of Chinese web pages exist. How to obtain useful information quickly and efficiently by computer has become a hotspot of study in the field of Chinese information processing. Chinese text classification has great influence to Chinese information processing and been applied in many fields, such as text identification, E-governance, search engine and information filtration.Support Vector Machine (SVM) is a new pattern recognition method developed in recent years based on statistical learning theory. It was first propounded by Boser, Guyon and Vapnik on COLT-92 and has successful application in text classification, image recognition and biological information processing. Compared with traditional method of classification, SVM shows many attractive features and emphatic performance in the fields of small sample, nonlinear and high dimensional pattern recognition. SVM is under the principle of structural risk minimization and has best overall solver. Classifier based on SVM can be provided with good outreach capacity and achieve high accuracy rate even with small sample.Text classification refers to judge the category of new text according to the given definitions of the categories. Automatic Chinese text classification needs word segmentation, which is different from English. In this paper, Chinese word segmentation is introduced first, and then algorithm named two-way matching term is designed, which effectively reduces the ambiguity of the Chinese words. Feature selection is an important link of text classification and after some of traditional algorithms of feature selection are analyzed, we propound improving strategies to algorithms of mutual information and Chi-square Statistic.Design of classifier is the core of text classification system. Current methods of classifier designing are discussed, especially the research on support vector machine, such as linear and nonlinear SVM, and classification results with different kernel function are compared. Through training process analysis for the general classification, the training data set selection of text classification is discussed andan algorithm named dynamic training data set is presented, which enhance the role of training data set in text classification in the training and studying process.Finally, with combination of text classification and method of support vector machine, a text classification system is designed and implemented. We use the common indicators, such as precision, recall and F value to judge the result of the text classification system. Experimental results show that the overall average of the system's indicators is high and the system has good result of classification.
Keywords/Search Tags:Text Classification, Chinese Word Segmentation, Feature Selection, Support Vector Machine
PDF Full Text Request
Related items