Font Size: a A A

Research Of Text Categorization System Based On SVM

Posted on:2009-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2178360248953618Subject:Measuring and Testing Technology and Instruments
Abstract/Summary:PDF Full Text Request
A great deal of electronic text information comes forth with the development of Internet. How to obtain useful information quickly and efficiently by computer has become a hotspot. The system of automatic text classification makes it easy. Automatic text classification is the important content of information processing, it is used in the field of text identification, E-governance, search engine and information filtration. Elevating the accuracy rate is very significant for its applications.This paper realizes a system of text classification based on Support Vector Machine (SVM). Compared with traditional method of classification, SVM shows many attractive features and emphatic performance in the fields of small sample, nonlinear and high dimensional pattern recognition. SVM is under the principle of structural risk minimization and has best overall solver.Based on the information of small-sample learning, SVM searches the optimal solution between the complexity and learning ability of model, so it can achieve best outreach capacity and solve the overfitting problem effectively. Classifier based on SVM can be provided with good outreach and high accuracy rate even with small sample.This paper introduces the basic process of Chinese text classification and primary technology such as text information expressing and feature selection, mostly refers to the algorithm of SVM classifier, analyses the elements that influence result and compares the classification results of different kernel functions. We makes a text categorization system based on SVM come true, this classifier can achieve multi-category classification. In the part of text preprocessing, we use ICTCLAS system to segment words, and we combine Document Frequency (DF) with Information Gain (MI) to select the feature. This method can avoid the disadvantages of DF and MI. Not as usual method, we use grid-search to optimize the parameters of kernel function. In the end the experiments show that this improved system can achieve the better result and higher accuracy rate.
Keywords/Search Tags:Text Classification, Support Vector Machine, Feature Selection, grid-search
PDF Full Text Request
Related items