Font Size: a A A

Research Of Feature Selection For Text Classification

Posted on:2012-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:L L LiuFull Text:PDF
GTID:2178330338993792Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Feature Selection is one of the most important issues in Text Classification. A good feature selection, by removing irrelevant and redundant ones from original features, can improve the efficiency and accuracy of a text classifier.Based on the assumption of conditional independence, the traditional FS methods focused only on removing irrelevant features by an evaluating measure. However, the correlation between feature and feature is ubiquity. That is, there are a lot of redundant features in the relevant feature subset.This paper includes both relevance analysis and redundancy analysis. Feature's distributional information can objectively reflect the correlation between features and text and the correlation between features and class. Thus, this paper starts from defining the concept of feature's distributional information. Firstly, we quantitatively measure the feature's distributional factors, and then, establish the relationships among feature's distributional information, feature-class's relevance and feature-feature's redundancy.This paper presents a feature relevance evaluating method - DIFS, which is based on feature's distributional difference among all classes. At the same time, this paper presents some feature redundancy evaluating methods, which is based on feature's distributional similarity among all classes or all texts. In addition, this paper designs two kinds of algorithms to implement DIFS, and gives three redundancy measures, SIM. To prove the performance of both'relevance analysis'and'relevance analysis + redundancy analysis', some experiments are carried out on a Chinese corpus and by comparison the proposed approaches show a better performance.
Keywords/Search Tags:Feature Selection, Text Classification, Distributional Information, Feature Relevance, Feature Redundancy
PDF Full Text Request
Related items