The Feature Selection Algorithm Based On Information Theory Research

Posted on:2013-10-07

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Liu

Full Text:PDF

GTID:2248330395950881

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of databases in data mining and machine learning application fields towards large-scale, high dimension direction, it may bring great challenges to traditional classification algorithms. Such as gene expression data analysis, the database analyzed bioinformation usually contains a huge number of features and a small number of gene expression samples. Redundant features or irrelevant features in this database not only reduce learning process, effecting algorithm accuracy and knowledge discovery and knowledge understanding, but also cause the Curse of dimensionality. Therefore, feature selection in the massive data condition is particularly important. Feature selection is removing irrelevant and redundant features from the original feature space according to a certain evaluation standard, to achieve the purpose of reducing the dimension of the feature space, have been widely used in many fields. By now, study of feature selection based on information theory is becoming one of hot topics in various fields.This study summarizes knowledge related he feature selection and information theory, and analyzes current development trend of information measurement as well as the typical information measure, a new metric standard normalized variation of information, called NVI, is proposed. We proof in detail that this expression is a metric distance and meets metric distance conditions--symmetry, non negative and the triangle inequality. an improved feature selection algorithm Based on the new metrics NVI, called IFCA, is proposed. The selection algorithm uses the K-means clustering algorithm the basic idea, and groups high-relevant features into a cluster, then from each cluster selects a features highly correlate to the class, in order to remove the redundant and irrelevant features at the same time.The experiment comparison with other state of art methods based on informatipn theoretic criteria has shown that NVI can obtain a smaller subset of features, higher efficiency, and better classification performance. This metric distance can not only describe the correlation between a feature and the class, can also describe the correlation between the feature and any other relevant features. Therefore, This metric distance, NVI, can be used as algorithm distance metric, is not confined to the proposed feature selection algorithm. The experiment comparison with other state of art methods based on information theoretic criteria has shown that selection algorithm IFCA proposed in this thesis has lower training and generalization error, it can be applied in high dimensional database.Simulation experiments carried out on public datasets show the performance and effectiveness of the selection algorithm IFCA proposed in this thesis. Nevertheless, there still are several problems in the proposed algorithm. Therefore, our future work will be carried out on these factors in order to further improve its performance and efficiency.

Keywords/Search Tags:

Data mining, Feature selection, Information theory, Mutualinformation, Learning algorithm, Clustering

PDF Full Text Request

Related items

1	Research On Feature Selection Algorithm Based On Information Theory
2	A Study On Feature Selection Algorithms Using Information Entropy
3	Research On Feature Selection Method Based On Three-way Decisions Theory And Feature Clustering
4	The Research And Application Of Spectral Clustering Algorithm In Data Mining
5	Research On Rough Set Theory Based Data Mining Algorithm
6	Research On Stratified Feature Selection Algorithms For High Dimension Data
7	Some Data Mining Algorithms Based On Information Theory
8	The Research On Feature Selection Algorithms Based On Information Theory
9	Research On Feature Selection And Classification Algorithms Based On Information Theory
10	Multi-Label Classification By Exploiting Relationship Of Labels