Font Size: a A A

The Research And Implementation Of Chinese Text Classification System

Posted on:2008-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:X L WangFull Text:PDF
GTID:2178360245493117Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, the amount of texts increases explosively. How to find the required information from massive information quickly and correctly, become an important problem in the information processing field. Text classification, the automated assigning of natural language texts to predefined categories based on their contents, is a path to solving the problem above.The key technique of Chinese text classification, including Chinese segmentation, feature selection, feature weight and classification methods is discussed. The widely used feature selection methods, such as mutual information, information gain, chi-statistic and so on, are based on different rules, they may score the same feature very differently. In order to overcome the shortage of single method, this paper considers the combinations of two or more feature selection methods. Experiment results show that the combination of two methods is better than that of single method and the combination of three or more methods. Also,the combination of feature selection methods, which perform well are all based on chi-statistic. The traditional feature weight methods (such as TF-IDF, the weight based on entropy) just consider the importance of features on the whole text collection. They can not reflect the importance differences of one feature in different text categories. Against this problem, this paper presents an improved method, which uses the mutual information to reflect the correlation between features and text categories. Then, this paper realizes the Rocchio and KNN classification methods, and the experimental results show that the modified weight method can actually improve the classification performance.
Keywords/Search Tags:text classification, Chinese segmentation, feature weight, feature selection
PDF Full Text Request
Related items