Font Size: a A A

A Study Of Tumor Classification Algorithms Using Gene Expression Data

Posted on:2013-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H J LuFull Text:PDF
GTID:1228330392454403Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of gene chip technology, more and more tumor geneexpression data could be determined. The early diagnosis of tumor is very importantat the level of molecular biology based on the gene expression data. An accurate earlydiagnosis is of great benefit to the treatment of tumor, and any misdiagnosis may leadcancer patients to miss the best treatment opportunity. It is well known that geneexpression data usually has some important features, such as high dimensions,imbalanced data distribution, and small-sample size. So how to effectively analyze,process and use the data has been drawing more and more extensive concern ofresearchers in this area. Due to a large number of redundant genes and noises, thegeneralization performance of gene expression data has not yet reached theapplication level currently. In order to solve the classification problems of tumor geneexpression data, current researches have been focusing on the following two aspects:(i) Identification of the few critical causative genes from high dimensional data;(ii)Development of the most suitable algorithms and improvement of its performance.This paper studies a novel machine learning algorithm, namely, extreme learningmachine, to build up the classification model and predict gene expression information.Some tumor and non tumor data sets extracted from the experiments are used tovalidate the developed algorithms. The achievements of this dissertation are brieflydescribed as follows:(1) A selection method based on genetic algorithm and information gain isproposed to reduce the dimensions of the data sharply. The genetic evolution is usedto transform the problem of gene selection into the one of the global optimization.The algorithm is designed with a fitness function that is given by the ratio of thebetween-class distance to the within-class distance in the genetic algorithm searchstage, and this designed algorithm is a model independent gene selection method forreducing the data dimension. Experimental results show that the selected features areclosely related to the objective. It improves the generalization performance of theclassifier.(2) In order to solve the problems of the imbalance data and small-sample ingene expression data, the idea of expanding small class sample and reducing the largeclass sample is explored and the FS-Sampling algorithm is put forward. It is seen thatthe crucial characteristic can be selected in terms of analyzing the gene expression data characteristics and synthetizing small class sample with SMOTE sampling theory.The experiments show that the presented methods can balance data distribution welland improve the classification accuracy of the tumor data effectively.(3) For the study of the impact of data distribution on approximation accuracy ofneural network model and the instability performance of single ELM, an ensemblealgorithm based on dataset splitting is presented, based on the ensemble strategy ofthe dataset difference. Firstly, the original training dataset is divided into k disjointsubsets. Secondly, the randomly re-sampling on k-1out of k subsets is performed toget a training dataset and then train a neural networks classifier with it. The trainingprocedure can then repeat for n times to obtain n neural networks. Finally, the classlabel of the unknown data is predicted with the ensemble classifier through majorityvote method. Experimental results show that the algorithm can enhance the differencedegree of the neural networks and effectively improve the accuracy of the classifierensemble.(4) To cope with imbalance performance of single ELM, ensemble classifier onthe level of outputting results is set up. Departing from the difference in the angle ofthe output of the classifier ensemble classifier, the ensemble classifier made from theselective classifiers with large dissimilarity namely D-D-ELM is presented. First of all,the diversity judgements of ELM models are made according to differentmeasurement in the outputs. And then the corresponding model is removed when theirclassification accuracy is under the average one. Finally, the selected classificationmodel is ensembled by means of voting. Both theoretical analysis and experimentalresults demonstrate that the algorithm can effectively improve the accuracy of theclassifier ensemble by the large difference degree of the neural networks.(5) For reducing the decision risk and average cost, and regarding the minimumclassification cost to be the target, the classification of embedded rejective recognitioncost and asymmetric misclassification cost are studied. The ELM algorithm for theembedded misclassification cost and rejective cost is proposed. It is shown that theembedding cost sensitive factor in the algorithm could cope with the data withdifferent costs directly. The experiments show that the method could reduce the totalclassification cost and improve the classification accuracy of the tumor dataeffectively.To sum up, how to develop the algorithms that can perform efficientclassification of the tumor gene expression data is a challenging task, since many existing algorithms suffer from the problems of small scale sample with highdimension, data dimension reduction and imbalance distribution. The work in thisthesis will develop the effective gene selection and the over sampling synthesismethods, which not only improve the performance of the classifier, but also exclude alarge number of unrelated genes. The work to be presented in this thesis could greatlybenefit to the further study and applications of the location of the disease genes andthe diagnosis of the related disease.For data classification, the ensemble classification model based on NeuralNetworks and the ELM is presented. The ensemble algorithm considers the differencefrom both dataset and classifier output results, and the cost sensitive factors areembedded in the algorithm in order to reflect the importance of different data duringtumor recognition. This work will develop a suitable algorithm framework forclassification of gene expression data and improve the classification accuracy oftumor gene expression data. The research shows the theoretical significance inclassification for high dimension and imbalanced data, and gives some helpfulapplication guides for tumor diagnosis.
Keywords/Search Tags:Tumor Gene Expression Data, Feature Selection, Ensemble Learning, Cost Sensitive, Extreme Learning Machine
PDF Full Text Request
Related items