Font Size: a A A

The Study On Feature Selection Methods For Automatic Text Categorization

Posted on:2011-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z H LiuFull Text:PDF
GTID:2348330503471939Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic text classification is a significant research topic in the field of information retrieval. In face of large mount of document data being of the high dimensions, how to improve the precision and efficiency of text categorization becomes a difficulty which cannot be ignored and should be resolved immediately by automatic text classification.As a dimension reduction method, feature selection can filter the irrespective or outlying features in the feature space of document data and select a feature subset fully representing the document content, so as to improve the precision and efficiency of text categorization. The present feature selection methods hold following remarkable defects: the daedal theoretical foundations being unsystematic, the applicability being restricted by the complex content and various class distributions of document data.When the imbalance classes occur to the document data set, each feature selection method will express different categorization performance. Combining the relativity between the evaluation function of Information Gain(IG) and that of Excepted Cross Entropy(ECE)with their characteristics possessed under different class distributions, we propose a weighted IG-based feature selection method(WIG). WIG, based on the evaluation function of IG, can auto-adaptivly adjust the weights between the occurrence and non-occurrence of feature according to different class distributions. So in contrast with IG and ECE, WIG has better adaptability and stability.In a text set, the degree of independence of a feature relative to each document class inflects the representation of the feature to the text content. According to the rule, we infer that all theories with respect to independence can be applied to feature selection for text categorization. Thus, we propose three feature selection approaches which are based on independence hypothesis test: Distributed Homogeneous Chi-square(DHChi2), Likelihood Ratio(LR) and Wald. Meanwhile, we also propose an event independence-based approach(EIBA) and its improved version(IEIBA). The application of independence theories to feature selection not only make feature selection find a united theory foundation which promotes the procedure of feature selection more systematic, but also help to improving text categorization performance.
Keywords/Search Tags:Feature Selection, Weighted Information Gain, Independence Hypothesis Test, Event Independence
PDF Full Text Request
Related items