Research On Hybrid Feature Selection Algorithm Based On Mutual Information And Random Forest

Posted on:2018-05-19

Degree:Master

Type:Thesis

Country:China

Candidate:W W Zhao

Full Text:PDF

GTID:2348330542952396

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

With the arrival of the information age,the data generated from all walks of life is so overwhelming that the emergence and growth of high dimensional data has brought great challenges to data processing.On the one hand,high dimensional data makes data processing easily fall into the trap of dimensionality curse.On the other hand,redundant features and irrelevant features in high dimensional data can interfere with the description and application of the data.Feature selection can lay the foundation for the subsequent data processing by reducing dimension and filtering noise.Mutual information is a kind of typical measure in information criterion.It does not need to know the distribution of the original data in advance and the original data after transformed still maintains the invariance of information entropy.Random forest can effectively identify the informative features and deal well with the relationship between features and classifiers.In this thesis,a kind of hybrid feature selection algorithm based on mutual information and random forest is studied by combining the advantages of the both.The thesis improves the deficiency of random forest feature selection algorithm proposed by Hapfelmeier and others.The improved algorithm uses random forest for feature selection which is based on the theoretical framework of permutation tests.Firstly,the data of each feature is permuted respectively.Each permutation needs to reconstruct the random forest and recalculate the importance values of all the features.An empirical distribution of feature importances is generated after multiple permutations.Then,an appropriate probability distribution is fitted to the empirical distribution of feature importances by using permutation importance algorithm.And statistical methods are used to evaluate the p-value of each feature from the probability distribution.If the permutation importance algorithm does not find the appropriate probability distribution,the way of p-value estimation in the original algorithm is still used.Finally,the feature is selected based on its p-value.Comparative analyses with seven algorithms show that the improved algorithm has a certain advantage in terms of classification accuracy,generalization ability and running time.The thesis proposes a new hybrid algorithm based on mutual information and random forest by combining a feature selection algorithm based on mutual information and the improved algorithm.The hybrid algorithm is divided into two stages.Firstly,we find features by a greedy search.Mutual information is used to evaluate the relationship between features and class variables,which quickly filters out some redundant features and irrelevant features and reduces the dimension of sample space.Secondly,the selected features in the first stage are input into the improved algorithm.The fianl feature subset is selected using random forest,which is based on the theoretical framework of permutation tests.Six real datasets are selected from the UCI database.The thesis compares the hybrid algorithm with seven existing algorithms.The experimental results show that the hybrid algorithm has a certain improvement in classification accuracy and generalization ability.

Keywords/Search Tags:

Feature Selection, Random Forest, Mutual Information, Hybrid algorithm

PDF Full Text Request

Related items

1	Research On Random Forest Algorithm Based On Feature Selection And Diversity
2	Research On Adaptive Feature Selection And Parameter Optimization Algorithm For Random Forest
3	The Research On Random Forest Based On IV Feature Selection
4	Research On Detection Of Abnormal Mobile Communication Users Based On Improved Random Forest
5	Research On Feature Selection Method Based On Random Forest
6	Feature Selection Based On Random Forest And Classification Complementariness
7	Optimization Of Distributed Random Forest Algorithm Based On Hierarchical Subspace
8	Research On Feature Selection And Classification Method Based On Random Forest For Medical Datasets
9	Random Forest Feature Selection
10	Research On Feature Selection Algorithm Based On Mutual Information