Font Size: a A A

Research On Hybrid Feature Selection Algorithm Based On Mutual Information And Random Forest

Posted on:2018-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:W W ZhaoFull Text:PDF
GTID:2348330542952396Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the arrival of the information age,the data generated from all walks of life is so overwhelming that the emergence and growth of high dimensional data has brought great challenges to data processing.On the one hand,high dimensional data makes data processing easily fall into the trap of dimensionality curse.On the other hand,redundant features and irrelevant features in high dimensional data can interfere with the description and application of the data.Feature selection can lay the foundation for the subsequent data processing by reducing dimension and filtering noise.Mutual information is a kind of typical measure in information criterion.It does not need to know the distribution of the original data in advance and the original data after transformed still maintains the invariance of information entropy.Random forest can effectively identify the informative features and deal well with the relationship between features and classifiers.In this thesis,a kind of hybrid feature selection algorithm based on mutual information and random forest is studied by combining the advantages of the both.The thesis improves the deficiency of random forest feature selection algorithm proposed by Hapfelmeier and others.The improved algorithm uses random forest for feature selection which is based on the theoretical framework of permutation tests.Firstly,the data of each feature is permuted respectively.Each permutation needs to reconstruct the random forest and recalculate the importance values of all the features.An empirical distribution of feature importances is generated after multiple permutations.Then,an appropriate probability distribution is fitted to the empirical distribution of feature importances by using permutation importance algorithm.And statistical methods are used to evaluate the p-value of each feature from the probability distribution.If the permutation importance algorithm does not find the appropriate probability distribution,the way of p-value estimation in the original algorithm is still used.Finally,the feature is selected based on its p-value.Comparative analyses with seven algorithms show that the improved algorithm has a certain advantage in terms of classification accuracy,generalization ability and running time.The thesis proposes a new hybrid algorithm based on mutual information and random forest by combining a feature selection algorithm based on mutual information and the improved algorithm.The hybrid algorithm is divided into two stages.Firstly,we find features by a greedy search.Mutual information is used to evaluate the relationship between features and class variables,which quickly filters out some redundant features and irrelevant features and reduces the dimension of sample space.Secondly,the selected features in the first stage are input into the improved algorithm.The fianl feature subset is selected using random forest,which is based on the theoretical framework of permutation tests.Six real datasets are selected from the UCI database.The thesis compares the hybrid algorithm with seven existing algorithms.The experimental results show that the hybrid algorithm has a certain improvement in classification accuracy and generalization ability.
Keywords/Search Tags:Feature Selection, Random Forest, Mutual Information, Hybrid algorithm
PDF Full Text Request
Related items