Font Size: a A A

Research On Methods Of Supervised Learning-based Cancer Feature Gene Selection

Posted on:2017-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:T LiFull Text:PDF
GTID:2334330488967338Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Machine learning is mainly a probability statistical model based on data,reasoning and finding the knowledge imbedded in data,and using the abstract model to analyze and forecast the data.The quality of data directly affects the performance of machine learning.In the field of medicine,There has large experimental error in the acquisition process of gene expression spectrum data,and also contains a large number of irrelevant genes and redundant genes with the dimensions of data increase rapidly,which not only reduces the performance of machine learning but brings great challenge to tumor diagnosis and prediction.Therefore,for gene expression spectrum data with high dimension and less samples,it is necessary to explore stronger robust,more explained algorithm model.Finding out the feature genes of disease distinguish fiction from the mass data,which have great research significance and application value for tumor disease diagnosis.Although many feature gene selection algorithms,there are still exist poor generalization ability and lower efficiency problems.The paper mainly study the gene expression spectrum data from the supervised learning according to these problems,selecting highly correlated and low redundancy feature subsets aim to improve the accuracy and running efficiency.The main innovations of the paper are listed as follows:(1)In order to solve the low accuracy problem of sample prediction caused by a large of irrelevant genes that selected by traditional feature gene selection methods.In this paper,a new method based on logistic and correlation entropy for feature gene selection is proposed.Firstly,compare the two conditional probability values by the binomial logistic regression model,then the genes that has great influence on the classification can be obtained,which can effectively reduce the time and space consumption of subsequent calculation;secondly,the relief algorithm was introduced in the calculation of feature genes importance and sort them,then delete the irrelevant gene to generate candidate feature gene subsets.Thirdly,measure the correlation between genes by constructing the correlation coefficient matrix,and then eliminating the redundant genes,which in a certain avoid overfitting of sample data and model;finally,the support vector machine is used as a classifier to classify the feature gene subset.Through the cross validation result of UCI data sets show that the proposed method can achieve a smaller gene subset and has higher correct classification accuracy.(2)In view of the traditional genetic selection method will select a large number of redundant genes leading to lower sample correct classification accuracy,a feature gene selection method is put forward based on the signal noise ration and the neighborhood rough set(SNRS).Firstly,from the perspective of metric feature weights,the revised signal noise ration is used to obtain the primary feature subset and based the signal noise ration divided into different sections,selecting the gene with larger signal noise ration as a candidate feature subset;on this basis,from the attribute reduction ideology,the rough neighborhood intensive algorithm is used to eliminate redundant genes in the candidate gene subset,the optimize feature gene subset can be obtained;Finally,feature gene subset is classified by three different classifiers.Through the experiment proved that the proposed method can get smaller scale feature gene subset with less feature genes,and advance the sample classification accuracy as well.(3)According to lower classification accuracy caused by the existing feature selection methods without fully consider the correlation between the features,in the paper,the based on statistical characteristic neighborhood rough set rumor gene selection algorithm is proposed.The algorithm starts from the gene expression spectrum feature selection model,first of all,analysis the measure methods of feature gene,a new evaluate standard is established by introducing the relative information entropy for the measure importance of feature genes,secondly,the new construct feature genes correlation calculation model is introduced into the FRE-SVM algorithm,and considering the joint contribution rate to classify the samples of feature genes;then,to optimize candidate feature gene subset by neighborhood rough set with different neighborhood radius;finally,using different classifiers to optimize feature subsets for classification.Experiments show that this method can overcome the shortcomings of the traditional classification lower accuracy,and can obtain higher classification accuracy in the case of a few feature genes.
Keywords/Search Tags:Supervised learning, Feature gene selection, Neighborhood rough set, logistic regression model, Signal noise ratio
PDF Full Text Request
Related items