Font Size: a A A

Study Of Efficient Feature Selection And Classification Methods For Gene Expression Microarray Datasets

Posted on:2019-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z F LiFull Text:PDF
GTID:2428330566493538Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since microarray technology was born,a large number of gene expression microarray data have been produced,which hides valuable biological information.Analyzing these data and mining the hidden biological information brings new possibilities for the diagnosis and treatment of complex diseases.Small number of samples,high dimension and imbalance are the main characteristics of microarray datasets,which is also the biggest challenge to the existing data mining technology.Based on the existing methods,this paper focuses on the study of more efficient feature selection algorithm while trying to solve the problem of class imbalance and searching for a more suitable classification algorithm for microarray data.Use six most frequently used datasets of this field as the experimental datasets,and use the classification accuracy,Matthew's correlation coefficient and area under ROC curve as the evaluation measure.At the same time,stratified 5-fold cross-validation strategy is applied to verify the proposed approaches.The main work and conclusions are as follows:(1)A data resampling method called RVOS is proposed to try to solve the problem of class imbalance of gene expression microarray datasets.The experimental results show that a fairly or better classification result is obtained by the balanced dataset.The classification results are more credible because the distribution of all kinds of samples is balanced.(2)Improve the recursive feature elimination strategy,and propose a recursive feature elimination method with variable step size called VSSRFE.Respectively,use SVM-VSSRFE and SVM-RFE as feature selectors to conduct feature selection.The experimental results show that the time consumption of SVM-VSSRFE has been reduced by hundreds,and a better classification performance has been achieved on three datasets.Meanwhile,the classification accuracy has decreased to some extent on the other three datasets.(3)A large scale linear support vector machine called LLSVM is introduced,which can realize feature selection more efficiently.This is a more efficient implementation of the common support vector machine(SVM),which is specially used to deal with the high dimensional linear classification problem similar to the microarray data.Experimental results show that,on the basis of guaranteeing the quality of feature selection,LLSVM consumes much less time than the classical support vector machine on five datasets,and even reduces over 10 times time consumption on some datasets.(4)The influence of different classifiers on the classification results is studied in depth.The experimental results on six datasets prove that support vector machines are not always the best choice,and the L2 regularized logical regression can get quite or better results.
Keywords/Search Tags:Microarray datasets, Class imbalance, Feature selection, RFE-SVM, Classification
PDF Full Text Request
Related items