Font Size: a A A

Feature Selection Models And Methods Based On Information Measure For High Dimensional Data

Posted on:2021-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y D WangFull Text:PDF
GTID:1488306557491604Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Feature selection in high-dimensional data is an important part of data mining process,which can be widely used in bioinformatics,statistics and image processing fields.Successfully selecting informative features can significantly increase learning accuracy and improve result comprehensibility.Various feature selection methods have been proposed to identify informative features by removing redundant and irrelevant features of data to improve classification accuracy.However,the feature dimension is increasing with the increasing of data scale,which is easy to cause the curse of dimensionality and over fitting.The high dimensionality of the data not only increases the time and space complexity of the algorithm,but also reduces the accuracy of the algorithm.To solve the issues of the feature selection methods in high-dimensional data,this dissertation designs some reasonable and effective feature selection models and methods by introducing some information measure of theory information,such as,mutual information,joint mutual information,conditional mutual information to reduce the dimension of data and retain the important features.The main contributions are shown as follows:(1)Adaptive Structured Sparse Regression(ASSR)model: High-dimensional data typically contain many important correlation structures that often conducive to improve prediction performance.In addition,high-dimensional data also generally contains many noisy features.Hence,it is challenging to find important correlation feature structures and remove noisy features from high-dimensional data.We develop two strategies to determine the weight of each pairwise correlated features and that of each feature based on mutual information and joint mutual information.An ASSR model is proposed to select features which can infer the local supervised correlation structure information among the features and adaptively select the important features in groups.Some important theoretical properties analysis for the ASSR model are also proposed.The proposed model can perform feature selection for both regression and binary-class classification issues.Experimental results on ten classical public benchmark datasets illustrate that the proposed model is effective in selecting the informative features and demonstrates competitive prediction performance when compared to some existing feature selection models.(2)Multinomial Adaptive Sparse Group Lasso(MASGL)model: Most feature selection methods typically contain many redundant features in the selected features from highdimensional data that often reduce classification performance.We propose a MASGL model to select important features in groups.To infer the local supervised correlation structure information among the features in high-dimensional data,a new supervised feature clustering algorithm is developed based on information theory to divide similar features,with respect to the class labels,into groups.To evaluate the importance of features and groups,a method for constructing both feature and group weights is proposed.Furthermore,an algorithm is developed to carry out the complex computation process of MASGL.Experimental results on both random and five frequently studied public benchmark datasets illustrate that the proposed model is effective in selecting the informative features and demonstrates competitive classification performance when compared to four existing classical feature selection models.(3)Max-Relevance and Min-Supervised-Redundancy(MRMSR)criterion: It is challenging to select informative features from high-dimensional data which generally contains many irrelevant and redundant features.These features often impede classifier performance.We present an efficient feature selection algorithm to improve classification accuracy by taking into account both the relevance of the features and the pairwise features correlation in regard to class labels.Based on conditional mutual information and entropy,a new supervised similarity measure is proposed.The supervised similarity measure is introduced into feature redundancy minimization evaluation and then combined with feature relevance maximization evaluation.A new criterion MRMSR is introduced and theoretically proved for feature selection.The proposed MRMSR-based method is compared to six existing feature selection approaches on several frequently studied public benchmark datasets.Experimental results demonstrate that the proposal is more effective at selecting informative features and results in better competitive classification performance.(4)Weighted General Group Lasso(WGGL)model: Feature selection from high dimensional biological data can identify genes that highly related to classification tasks and improve classification accuracy.The ideal way to solve the binary problem of highdimensional cancer gene expression data is to automatically select the group genes closely related to cancer and perform classification at the same time.Intrinsic interactions information among selected genes cannot be fully exploited by most existing gene selection methods.We propose a WGGL model to select important genes in groups.A gene grouping heuristic method is presented based on weighted gene coexpression network analysis.To determine the importance of genes and groups,a method for calculating gene and group weights is presented in terms of joint mutual information.To implement the complex calculation process of WGGL,a corresponding solving algorithm is developed.Experimental results on both random and three cancer gene expression data demonstrate that the proposed model achieves better classification performance than two existing gene selection methods.
Keywords/Search Tags:Feature selection, high-dimensional data, information measure, regression, binary classification, multi-class classification
PDF Full Text Request
Related items