Font Size: a A A

Dimension Reduction And Programming Algorithms For Biological Classification Problem

Posted on:2016-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:M L ZhouFull Text:PDF
GTID:2308330479976786Subject:Computer applications
Abstract/Summary:PDF Full Text Request
The gene expression profile based on the biological high-throughput OMICs technologies has been computationally investigated for over thirty years, but challenging problems like disease biomarker detection still exist. Cancer is one of the major diseases causing human death, and the current clinical trials show that the sensitive and specific diagnosis of cancer subtypes is required by most of the known effective treatments. The experimental data support that the cancer patients can be precisely separated from the healthy controls by a subset of genes extracted from the high-dimensional gene expression profiles(binary classification problem).Microarray is a stable and mature biotechnology to simultaneously detect the expression levels of all the genes in a given biological sample. The major handicap is the difficulty in collecting patient samples, due to the limited number of patients accessible to the investigator, and the modern costly data generation technologies. It is biologically known that only a few genes are involved in the onset and development of a given disease, and the others do not have association with this disease, which will introduce noise into and significantly decrease the classification performance of the disease detection model. This leads to the major paradigm in the biological big data, the "large p small n" paradigm. From the computational aspect, a large number of features may introduce the "over fitting" problem in training a classification model with a small number of samples. An over fitting model shows a significant dependency on a training dataset and usually works much on datasets other than the training one. The procedure of "feature selection" is required to extract a small number of features significantly associated with the phenotypes or class labels.Quite a few association-based feature selection algorithms were published, and the classification performance still remains to be improved, due to the limitations of these algorithms’ prerequisite hypothesis on the data fitting functions. This thesis introduces the concept of Maximal Information Correlation(MIC) into the feature selection procedure CFS, and satisfyingly generates a solution for the difficult high-dimensional 0-1 programming problem, using the heuristic feature screening strategy.The fusion of biological knowledge into the feature selection procedure is further investigated, and the constrained linear programming model is adopted. As far as we know, this is the first feature selection algorithm with user-defined constraints. A lot of genes have been proven to be associated with some diseases, based on the rapid developments of molecular biological technologies and molecular diagnosis and treatment technologies. However, the existing feature selection algorithms only optimize a statistical function, without considering any known disease biomarkers. This thesis introduces the known disease biomarkers as the programming constraints into the linear programming model, and optimizes the binary classification error rate by fixing the user-defined biomarkers in the finally chosen feature subset. There are usually multiple sub-optimal solutions in the solution space of an optimization problem, and our hypothesis is that the solution with the fixed user-defined biomarkers may exhibit both higher classification performance and better biological functions. The experimental data on dozens of datasets supports the hypothesis.The feature number will be considered as the next optimization goal, based on the current feature selection model.
Keywords/Search Tags:Bioinformatics, gene expression profile, feature selection, maximal information coefficient(MIC), heuristic rule, constrained linear programming
PDF Full Text Request
Related items