Font Size: a A A

Research On Mutual Information Based Feature Selection Method For High Dimensional Small Sample Data

Posted on:2018-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhangFull Text:PDF
GTID:2348330521951633Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the improvement of computer performance and the progress of data acquisition technology,the generated data in various fields grow dramatically.High-dimensional data may contain irrelevant and redundant features,which not only increase the difficulty of calculation,but also reduce the accuracy of classification.Thus high-dimensional data classification has become a very challenging problem.The selection of the most useful feature of the information system from the original feature(feature selection)is an effective way to solve this challenging problem.Therefore,feature selection has become one of the hotspots in the domain of data mining,pattern recognition and machine learning,and has been widely concerned in many applications.The ultimate goal of feature selection is to find a subset of features with as much valuable information as possible,while the size of subset should be as small as possible.A number of studies have proposed some evaluation criteria to better measure the correlation between feature subsets and categories and the redundancy within the feature subset.Among them,the measure based on mutual information(MI)has the advantage that both the linear relationship and the nonlinear relationship can be identified.But for the existing methods,some do not consider the redundancy of features,or some do not consider the locality of the redundancy.These problems may have a negative impact on the improvement of classification performance.In this thesis,we focus on the feature selection of high-dimensional small sample data.The main contents include the following aspects:(1)Presents a Group Based Filtering Feature Selection Algorithm.On the basis of the previous typical feature selection algorithm,both the correlation between the single feature and the category and the locality of feature redundancy are considered.Thus we propose a group based mutual information filtering feature selection algorithm(GBFS).Comparison of fivetypical mutual information-based feature selection methods on the UCI machine learning data set shows that: GBFS feature selection algorithm has good classification accuracy and less run time simultaneously.(2)Proposes a GBFS and Boruta based Hybrid Feature Selection Algorithm.Wrapper type selection algorithm is slow,the selected feature classification effect is good.Filter algorithm is efficient and versatile.However,most Filter-type feature selection algorithms do not determine the optimal subset size.The GBFS algorithm can not determine the best feature subset as well as most filtering feature algorithms.The proposed hybrid model uses a combination of Wrapper type and Filter type evaluation criteria to select a subset,which takes advantage of both models.The characteristics of the Wrapper and Filter methods are well complementary in the hybrid model.Based on the above analysis,a two-stage hybrid feature selection algorithm based on GBFS and Boruta(GBFS-Boruta)is proposed.Based on the analysis of high-dimensional small sample data,two feature selection algorithms based on mutual information are proposed.They extend the classical mutual information feature algorithm and improve the classification accuracy while reducing the size of the optimal subset.The obtained research results have some significance for the research and application of feature selection.
Keywords/Search Tags:Filter Feature Selection, Mutual Information, GBFS
PDF Full Text Request
Related items