Font Size: a A A

The Application Of Multiple Classifier Systems In The Analysis Of Gene Microarray Datasets

Posted on:2009-10-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:K H LiuFull Text:PDF
GTID:1118360242495863Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Multiple classifier system (MCS) has drawn much attention in the field of machine learning. Owing to the fusion of a set of base classifiers, the final ensemble system has better generalization ability compared with a single excellent classifier. So the ensemble system is a promising solution for many problems, which are 'hard' for the traditional pattern classification methods.DNA microarray technology is a newly developed technology, formed by the interdiscipline of physics, electronics and molecular biology, etc. Microarray technology has been widely applied to the study on biological and medical fields. Among its applications, the microarray technology based cancer diagnosis makes it possible to deeply study the cancer pathological mechanism, including the occurring and diffuseness of cancer. In order to achieve reliable diagnosis and prediction on the type of cancers, many researches are focused on the identification of key genes to different cancers and the classification of cancers. However, due to problems with the small sample size and high dimensions, the traditional methods can not always achieve good performances.This thesis is focused on the analysis and classification of microarray datasets based on multiple classifier systems. The main work of this thesis can be concluded as follows:(1) The selection of key genes in the microarray dataset is regarded as a feature selection problem usually. In this study, the merits of filter and wrapper methods are combined to design two ensemble feature selection systems, which are based on a standard genetic algorithm (GA) and a multi-objective GA, respectively. With these methods, filter methods are applied to pick up a set of genes firstly, and then the GAs are used to select proper subsets so as to construct base classifiers. The corresponding experimental results show that these methods are capable of selecting optimal feature subsets, and the ensemble systems built in this way are robust.(2) Independent Component Analysis (ICA) is a recently proposed linear transformation method, and has been applied to the analysis of microarray datasets successfully. Inspired by the ensemble feature selection, an ensemble independent component selection method is proposed. In the application of this ensemble method, a microarray datasets is transformed by the ICA algorithm to obtain an independent component (IC) set firstly, and then a standard GA is used to pick up a set of IC subsets from the IC set to construct different base classifiers. Because this method can guarantee the diversity among the base classifiers, the ensemble system will be robust even when simply combining the base classifiers using majority vote rule.(3) When applying ICA algorithms to microarray datasets, it is found that the results are not always reproducible. That is, after different ICA transformations, different IC sets will be obtained. So in this thesis, a multi-objective GA is proposed to select optimal IC subsets from different IC sets. Then these IC subsets are used to train base classifiers, which are used to build the ensemble system. With this method, the diversity among base classifiers is much higher than the former method, so this ensemble system is of great generalization ability.(4) Rotation forest is a newly proposed ensemble system, and its success lies in that a linear transformation method is deployed to build a rotation matrix, which is then used to project the data into different axes. In this way, diverse base classifiers are obtained. As this ensemble system requires great computational cost when classifying datasets with high dimensions, it has never been proposed to deal with the microarray datasets. In this thesis, filter methods are used to reduce the dimension of datasets so that the Rotation Forest can be used to analyze the microarray datasets. And here, ICA is employed to construct the rotation matrix for the first time. The experimental results show that Rotation Forest can achieve better performance compared with other ensemble schemes, and ICA based Rotation Forest achieves the highest classification accuracy.(5) The classification problem in multiclass microarray datasets is much more difficult compared with two-class datasets, because usually the samples belonging to each class are fewer and the distributions of samples in different classes are unbalanced. To efficiently classify multiclass microarray datasets, a GP is proposed based on the idea of splitting multiclass problem into multiple two-class problems. The characteristic of this GP is that each individual consists of a set of small-scale ensemble systems (named as sub-ensemble here), which are used to tackle respective two-class problems. In this way, each individual can solve a multiclass problem directly. And this GP can be used to solve feature selection and classification problem at the same time. Here, a diversity measure is proposed based on the difference among the features in each tree, and a greedy local improvement algorithm is used to maintain the diversity among the sub-ensembles. These measures ensure the high efficiency of the GP.
Keywords/Search Tags:Multiple Classifier System, Microarray Datasets, Genetic Algorithm, Ensemble Feature Selection, Diversity, Independent Component Analysis, Genetic Programming, Base Classifier, Rotation Forest
PDF Full Text Request
Related items