Font Size: a A A

Application And Research Of Feature Selection Method In Chinese Text Categorization

Posted on:2012-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:G D HuFull Text:PDF
GTID:2178330332990753Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text categorization technology can be used to help us obtain some useful information from numerous information, it has already been widely studied and been used. Text categorization is to compare some unknown text with some pre-defined classes. If the unknown text matches the one or more known classes, then we say the unknown text belongs to the corresponding classes.In text categorization, feature selection is one important factor which affects the effect of text categorization. Feature selection is that finding out the most representative some features in original features, it can reduce feature dimension. In text fields, feature selection is necessary to make the categorization task more efficient and accurate. This paper points out the shortcomings of the traditional Chi-Square (CHI) statistic by analysis and comparison. On the one hand, it only cares about frequency of feature term in all text rather than in one text, this reduces the final accuracy of categorization. On the other hand, it only cares about feature term of appearing many times. If we research classes of feature term of appearing less times, also choose these feature terms, the effect of categorization would be inaccurate. In order to solve the shortcomings of CHI statistic, this paper improves CHI statistic by frequency idea, and validates it by K-Nearest Neighbor (KNN) algorithm and Support Vector Machine (SVM) algorithm. In addition, this paper also puts forward a new feature selection method, it has not been used in Chinese Categorization System, it will be a studied direction in the future. Finally, this paper builds a Chinese Categorization System. its each module is independent. We can modify one of modules, but will not affect the other modules.This paper uses the Open Text Categorization System supplied by Li Ronglu from FuDan university, this experiment chooses the KNN and SVM classifier to test the improved CHI methods. In order to achieve the experimental comparability, in the experimental process, corpus, classifier and parameters are all the same conditions. This experiment analyzes and compares many experimental results on the classified right document amount, histogram, recall, precision and confusion matrix. These results show the effect of the improved CHI methods are better than the traditional CHI method, and account for the improved methods are feasible and effective.
Keywords/Search Tags:text categorization, feature selection, CHI statistic, KNN, SVM
PDF Full Text Request
Related items