Font Size: a A A

Ensemble Feature Selection For Omic Data

Posted on:2018-07-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:J S YanFull Text:PDF
GTID:1310330536455916Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the advance of high throughput biological technologies,various huge omics data sets have been produced in the fileds of genomics,protomics,and metabolomics.Our basic task in bioinformatics is to extract valuable information from the omics data.Feature selection and classification are the most widely used techniques for information extraction of omics data.However,the high-dimentionaliy,imbalance samples,and complex statistical distribution of omics data post great challenges for feature selection and classification.This thesis presents some novel ensemble feature and model selection methods applicable to omics data.1.An ensemble maximum relevance and minimum redundancy(m RMR)feature selection algorithm is proposed.Particularly,the maximum coefficient information,pearson correlation coefficient,and mutual information are investigated and applied in the mRMR feature selection framework.The improved forward feature subset search algorithm is introduced to m RMR feature selection resulting in an improved mRMR feature selection.Different feature subsets obtained by using different coefficient measures are ensembled to make the final prediction.The experiment results of classification on vairous omics data demonstrate that the ensemble mRMR feature selection method can improve the classification accuracy efficiently.2.An ensemble wrapper feature selection algorithm is proposed based on constrained niching binary particle swarm optimization(PSO)for omics data classification.High-dimensional omics data tend to have multiple optimal/suboptimal feature subsets.The classification model using a single feature subset could suffer from poor performance due to the effects of overfitting.To improve the classification performance,this thesis proposes an ensemble wrapper feature selection based on constrained niching binary PSO for omics data classification.Multiple diverse optimal feature subsets can be found with niching binary PSO.The corresponding multiple weak classifiers based on these optimal/suboptimal featuresubsets can be ensembled to form a strong classifier.The experiment results of classification performance comparison on omics data show that the proposed feature selection algorithm obtains better performance than the other competitors.3.In order to deal with the impact of sample imbalance across multiple classes on feature selection and classification,the thesis proposes a novel iterative ensemble feature selection(IEFS)framework for multiclass classification of imbalanced omics data.Three filter feature selection algorithms and balance sampling algorithms are used in the proposed framework.Filter feature selection and sample balance are performed iteratively and alternatively so that feature subset can be selected in a balanced sample distribution.The experiment results show the proposed iterative ensemble feature selection obtains superior or comparable classification performance to other feature selection algorithms without the sample balance preprocessing.4.A novel feature and model selection based on PSO is proposed for omics data classification to overcome the limitation of IEFS.The particles encode the candidate combinations of sample balance,feature selection,and classification models and their corresponding parameter settings.The continuous iterative swarm particle optimization can search the combination of models and parameter settings with best classification performance adaptively.The experiment results show that the proposed model selection is capable of finding the best combination of models and parameter settings with the superior classification performance adaptively and avoiding the subjective bias introduced by manual settings.In summary,this thesis proposes a several ensemble feature and model selection algorithms adapted to the statistical characteristic of features and samples of omics data.The ideas may provide insights to the researchers facing the same problems.
Keywords/Search Tags:Omics Data, Classification, Particle Swarm Optimizaiton Algorithm, Feature Selection, Model Selection
PDF Full Text Request
Related items