Font Size: a A A

An Improved Approach To CHI In Feature Selection Of Chinese Text Categorization

Posted on:2009-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:P Z ZhangFull Text:PDF
GTID:2178360272475561Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Absolutely, drawing the valued information from the large quantity of miscellaneous text is a hard assignment, while text categorization is just that solution to implement this. Among which, the feature selection and text categorization arithmetic are the two key research directions. Regarding the feature selection, the goal of it is to select the most representative feature, by which the text space can be cut down. At the same time, not only the text categorization efficiency is enhanced, but also the categorized precision is improved by avoiding voice feature terms. On the other side, the latter one is a strong weapon to advance the categorization effect.Feature selection technology is an essential part of text categorization, which affects directly precision of categorization. Traditional CHI approach is mainly studied in this paper, based on comprehensively studying feature selection approaches of text categorization, and is found that it has two limitations:1)it only take text frequency of feature in all texts into account, in defiance of feature frequency in one text, that means it is not reliable to feature of low text frequency. If a feature term appears frequently in a few documents of a category, it may have most contribution to the categorization such as expert terms, obviously, they can be a good representative of the characteristic of this category, however, traditional CHI approach does not take this case into account.2)it compares contribution of feature terms to one category with other categories, thus it may select feature terms which have more contribution to other categories. These feature terms always have low frequency in one category and widely exist in other categories, obviously, these terms can't represent the feature of this category.In response to the shortcomings of traditional CHI approach, this paper comprehensively takes criterions such as frequency, concentration among categories and distribution within categories and proposes an improved CHI approach. Feature terms which appear frequently in one category a good representative of the characteristics of this category, so we take frequency into account; A helpful feature term should mostly appear in one category rather than appear in all categories, so we take concentration among categories into account; A feature term evenly distributed among documents of a category is helpful to the category, so we take distribution within categories into account.The other work of this paper is to build a Chinese text categorization system. Word segmentation, feature selection and text categorization are three parts of the system. They are independent, but they have consistent interface. It means every part can conveniently use other parts and changing of one part is transparent to other parts. It is very convenient to improve one part without effects to other parts.In order to verify efficiency of improved CHI approach, there is a contrastive experiment. The experiment results show that improved CHI approach is superior to traditional CHI approach in feature selection, which verifies efficiency and probability of improved CHI approach.
Keywords/Search Tags:Text categorization, Feature selection, CHI approach, Chinese text categorization system
PDF Full Text Request
Related items