Font Size: a A A

Application Study Of Gene Expression Data On Diagnosis Of Tumor And Prediction Of Gene Function

Posted on:2010-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:D S HuangFull Text:PDF
GTID:1114360275967467Subject:Oncology
Abstract/Summary:PDF Full Text Request
IntroductionThe postgenomic era has led to a multitude of high-throughput methodologies that generate massive volumes of gene expression data.Microarrays are capable of determining the expression levels of thousands of genes simultaneously and have greatly facilitated the discovery of new biological knowledge.Microarray experiments may lead to a more complete understanding of the molecular variations among tumors and hence to a more accurate and informative classification.However,this kind of knowledge is often difficult to grasp,and turing raw microarray data into biological understanding is by no means a simple task.Even a simple,small-scale,microarray experiment generates thousands to millions of data points.One feature of microarray data is that the number of tumor samples collected tends to be much smaller than the number of genes.The number for the former tends to be on the order of tens or hundreds,while microarray data typically contain thousands of genes on each chip.In statistical terms,it is called 'large p,small n' problem,i.e.the number of predictor variables is much larger than the number of samples.Thus, microarrays present new challenge for statistical methods.Traditional statistic methodologies in classification or prediction do not work well when the number of variables p(genes) far too exceeds the number of samples n.So,appropriate choice of existing statistical methodologies or development of new methodologies is needed for the analysis of gene expression microarray data.A reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer.Gene expression microarrays have provided the high-throughput platform to discover genomic biomarkers for cancer diagnosis and prognosis.Many gene expression signatures have been identified in recent years for accurate classification of tumor subtypes or for prognosis of patient survival outcome. Current methods to help classifying human malignancies mostly rely on a variety of feature selection methods and classifiers for selecting informative genes.Many previous studies focused on one method or single dataset.Cancer is not a single disease, there are many different kinds of cancer,arising in different organs and tissues through the accumulated mutation of multiple genes.Evaluation of the most commonly employed methods may give more accurate results if it is based on the collection of multiple databases from the statistical point of view.Rational use of the available bioinformation can not only effectively remove or suppress noise in gene chips,but also avoid one-sided results of separate experiment. However,only some studies have been aware of the importance of priori information in tumor classification.Together with the application of discriminant techniques,we propose one method that incorporates prior knowledge into tumor classification based on gene expression data.The main problem is how to incorporate prior biological knowledge and where to get it from.For the purposes of this study,prior knowledge is any information about lung adenocarcinoma related genes that have been confirmed in literature.Prior knowledge is viewed here as a means of directing the classifier using known lung adenocarcinoma genes.Clustering analysis was first applied to microarray analysis in the late 1990s,and has become an increasingly important tool for gene expression analysis.Because co-expressed genes are likely to share the same biological function,cluster analysis of gene expression profiles has been applied for gene function discovery.Cluster diagrams display the "terminal branches" as a list of genes that share similar behavior across multiple experiments.Most clustering analyses work by first defining a distance metric based on a biological network,either a pathway or GO.The distance is not complete and accurate if ignoring enough prior knowledge.In general,with relatively high noise levels of genomic data,it is recognized that incorporating biological knowledge into statistical analysis is a reliable way to maximize statistical efficiency and enhance the interpretability of analysis results. ObjectivesIn the present study,we evaluated most commonly employed discriminant methods and explored their features and application in order to improve accuracy for tumor/tissue classification.We then incorporated prior biological knowledge into tumor classification to improve accuracy for tumor/tissue classification.We presented a new methodology that combines prior knowledge about gene functions and cluster analysis for analyzing gene expression data in order to improve the accuracy and explanation validity of cluster results.MethodsThe performances of several popular discrimination methods for gene expression data were studied with five publicly available cancer microarray datasets.Nearest shrunken centroid method(PAM),shrunken centroids regularized discriminant analysis (SCRDA) and multiple testing procedure(MTP) were used for feature gene selection, the methods of classification included K nearest-neighbor classifiers(KNN),linear discriminant analysis(LDA),SCRDA,PAM,C-classification support vector machine(C-SVM),shrinkage linear discriminant analysis(SLDA),shrinkage diagonal discriminant analysis(SDDA) and back-propagation artificial neural network(BP-ANN).The five publicly available cancer microarray datasets were(1) MPM &ADCA,(2) colon,(3) multi-class lung cancer(4) multi-class children cancer, (5) multi-class brain cancer.The performances of the above mentioned discrimination methods for significant gene selection were also studied.A public well-known dataset,Malignant pleural mesothelioma and lung cancer gene expression database,was used in this study.Information about genes which are associated with lung adenocarcinoma was retrieved from the journal entitled "Cancer Research".The location and expression level of these genes in database were gotten, differential expression was analyzed by multiple t test.Genes with significance were retained,feature(gene) set by combining gene from PAM or RDA gene selection method was constructed,and then the feature(gene) set was used for later discriminant analysis.The methods included K nearest-neighbor classifiers(KNN),linear discriminant analysis(LDA),quadratic discriminant analysis(QDA),shrunken centroids regularized discriminant analysis(SCRDA),nearest shrunken centroid method(PAM), partial least square(PLS),generalized partial least squares(GPLS),principal component regression(PCR),ridge regression(RR),C-classification support vector machine(C-SVM),shrinkage linear discriminant analysis(SLDA),shrinkage diagonal discriminant analysis(SDDA) and back-propagation artificial neural network (BP-ANN).To take advantage of accumulating gene functional annotations,we proposed incorporating known gene functions into a new distance metric,which equals the sum of the measure distance and biological distance.A two-step procedure was used,first, the shrinkage distance metric was used in any distance-based clustering method,e.g. K-medoids or hierarchical clustering,to cluster the genes with known functions. Second,while keeping the clustering results from the first step for the genes with known functions,the expression-based distance metric was used to cluster the remaining genes of unknown function,assigning each of them to either one of the clusters obtained in the first step or some new clusters.The above procedures were performed by software R 2.80(R foundation for Statistical Computer,Vienna,Austria).ResultsConventional method-LDA could not work when the number of genes was more than sample size.The SCRDA used much more genes than PAM for all cancer datasets.When comparing the performance of classifiers in two-class and multi-class diagnosis problem,SDA,SCRDA and PAM all had better classification accuracy and stability than LDA.SVM got higher accuracy than BP-ANN.Performance of KNN declined obviously when the use of feature(gene) selection was compared with that of all genes.Compared with conventional methods,the performance of new method improved more or less except several special cases.Average accuracy of new method improved in training and test set when compared with conventional methods in most cases,while the standard deviations of new method were usually less than those of conventional method.A simulation study and an application to gene function prediction for the yeast demonstrated the advantage of our proposal over the standard method.ConclusionsVariable selection did have impact on the performance of the classifiers,especially on KNN.There existed obvious differences for gene selection between PAM,SCRDA and MTP.PAM selected fewer genes than SCRDA and SCRDA selected fewer genes than MTP.Regularized discriminant method,especially SLDA was superior to conventional LDA.While given the same genes,performance of PAM,SCRDA and SDA had no difference at all.SVM showed better performance than BP-ANN in some circumstances,while selection of kernel and parameter should be paid more attention.The method that incorporated prior knowledge into discriminant analysis could effectively improve the capacity and reduce the impact of noise.This idea may have good future not only in practice but also in methodology.The accuracy and explanation validity of cluster results for gene expression profiling could be improved by combining prior knowledge,it will have a good future of application.
Keywords/Search Tags:Prior knowledge, Discriminant analysis, Microarray data, Tumor classification, Cluster analysis
PDF Full Text Request
Related items