Font Size: a A A

X ~ 2 Statistics-based Chinese Text Categorization Feature Selection Method

Posted on:2010-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:T XiaoFull Text:PDF
GTID:2208360275952215Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of WWW, the number of documents on the Internet increases swiftly and violently. The user must find the information which in the magnanimous information oneself needed, likely looked for a needle in a haystack the same difficulty. How to obtain the useful information from the large number of complex text information? Text categorization is one of the most important ways. Feature selection methods and classification algorithms are important research direction of text categorization.Text Feature selection is an important part of text categorization. It will directly affect the precision of text classification. In this paper, a comprehensive analysis of the characteristics of text classification on the basis of selection methods, we focus on X~2 statistics feature selection method. The traditional X~2 statistics feature selection has two limitations:1) it only take text frequency of feature in all texts into account, in defiance of feature frequency in one text, that means it is not reliable to feature of low text frequency. If a feature term appears frequently in a few documents of a category, it may have most contribution to the categorization such as expert terms, obviously, they can be a good representative of the characteristic of this category, however, traditional X~2 approach does not take this case into account. 2) The feature term appearsfrequently in other classes not in the specified class. Obviously, such feature term cannot represent this specified class. However, traditional X~2 approach does not takethis case into account.To overcome the shortcomings of traditional X~2 approach, this papercomprehensively takes criterions such as document frequency and Class accuracy of the traditional statistical methods to improve X~2 approach. Feature terms which appearfrequently in one category a good representative of the characteristics of this category, so we take frequency into account; A helpful feature term should mostly appear in one category rather than appear in all categories, so we take concentration among categories into account; A feature term evenly distributed among documents of a category is helpful to the category, so we take distribution within categories into account.The other work of this paper is to build a Chinese text categorization system. Word segmentation, feature selection and text categorization are three parts of the system. They are independent, but they have consistent interface. It means every part can conveniently use other parts and changing of one part is transparent to other parts. It is very convenient to improve one part without effects to other parts.In order to verify efficiency of improved X~2 approach, there is a contrastiveexperiment. The experiment results show that improved X~2 approach is superior to traditional X~2 approach and the improved approach in feature selection, which verifies efficiency and probability of improved X~2 approach.
Keywords/Search Tags:Text categorization, Feature selection, x~2 approach, Chinese text categorization
PDF Full Text Request
Related items