Font Size: a A A

Research On Feature Selection Method Based On Information Diversity Analysis

Posted on:2020-01-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:J CaiFull Text:PDF
GTID:1368330620454215Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of acquiring information technology,data is now growing and accumulating at an unprecedented rate.The information society has entered the era of "big data".These data often present the characteristics of big sample size and high dimensionality,which poses great challenges to data analysis and decision making as well as to machine learning and data mining.With the rapid increasing of the data dimension,there are lots of redundant,irrelevant and even noisy information in these data.As a result,it may cause great trouble to the modeling of machine learning algorithms,increase the computational cost and meanwhile degrade the generalization performance and accuracy of the learning model.However,feature selection can remove a large number of redundant,irrelevant and noisy features during the preprocessing of the data,and good feature selection results can efficiently generate more accurate machine learning models.Therefore,the research of feature selection of high-dimensional data and its application has profound practical meaning.In this dissertation,we carried out a series of research work on clustering-based feature selection,heuristic feature selection,deep feature selection and ensemble feature selection from the perspective of information diversity analysis.Information diversity is a good way to measure the distribution difference between variables.Specifically,the information diversity analysis indicators,information distance measure and its transformation as well as information cross entropy are selected as evaluation criteria to design new feature selection criteria and methods.The proposed feature selection methods are applied in a variety of classification models,and good classification prediction accuracy is obtained.The main contents and contributions of the research work are as follows:(1)By considering the insufficient expression of feature diversity in the clustering-based feature selection algorithms,we use the information distance as diversity analysis indicator and propose a feature selection method based on density peak clustering,called DPCID(Density Peaks Clustering based Feature Selection using Information Distance).DPCID method first establishes the maximum relevance and maximum diversity criterion based on information distance,and then obtains the optimal solution of the criterion with the density peak clustering algorithm.In particular,to avoid noise features being selected as representative features since they might be grouped into one or more clusters,feature clustering is performed on relevant feature set by eliminating noise features.At the same time,the information diversity between features of different clusters is maximized.The proposed method is verified on high-dimensional gene expression and text datasets.By comparing with classic filter feature selection methods and clustering-based feature selection methods on different classifiers,the experimental results show that the proposed method outperforms the other approaches.(2)Feature selection methods based on information relevance tend to select features with high entropy,which may lead to overfitting.Here we introduce self-redundancy factor as an appropriate punishment when selecting features with high entropy,and propose a heuristic feature selection method based on information distance measure,called MFFID(Maximizing the Feature-Feature Information Distance).The feature selection criterion of MFFID is the expression form of the information distance measure,based on which a new forward incremental feature selection algorithm is proposed.The MFFID method makes full use of the diversity between features to perform appropriate punishment for features with high entropy.The proposed method and the representative heuristic feature selection algorithms are compared in terms of different number of selected features and different classifiers.Experimental results on 12 gene expression profiles show the superiority of MFFID.(3)When modeling small sample dataset by deep learning model,there are often over-fitting problems with high training accuracy and low test accuracy.As a matter of fact,the smaller the information cross-entropy of features,the smaller the Bayesian classification error rate.Based on this assumption,we propose a high-level automatic coding feature selection method based on cross entropy,called HDAECE(Feature selection algorithm for High-level Denoising Automatic Encoder based on Cross-Entropy),to simplify the automatic coding network structure and construct a classification model with strong generalization ability.The experimental analysis of the proposed method is carried out with varying parameters.Compared with classic feature selection algorithms and deep neural networks,the selected high-level features can construct a classification model with better performance.(4)Ensemble feature selection is in essence the combination of classifier ensemble and feature selection.However,there is no good measure for the diversity between feature subsets in most ensemble feature selection methods and the feature subsets are divided by random strategy.As a result,the diversity between the selected feature subsets is not guaranteed,which inevitably leads to unstable performance of the ensemble methods.Therefore,we design an expression model of ensemble feature selection based on the information distance measure between feature subsets and present a novel information diversity measure between feature subsets,called SMID(Sum of Minimal Information Distance).It is proved theoretically that this measure is the upper bound of the information distance measure between feature subsets.Using SMID as a substitute for the information distance measure which is difficult to calculate between feature subsets,we develop a new ensemble feature selection framework.Concretely,this framework is combined with mRMR,CMIM and JMI algorithms to generate specific ensemble feature selection algorithms.Compared with other ensemble classification methods and feature selection methods,the proposed framework not only performs steadily under different parameters,but aslo achieves better effectiveness.
Keywords/Search Tags:Data Mining, Feature Selection, Information Distance, Cross Entropy, Ensemble Learning
PDF Full Text Request
Related items