Font Size: a A A

Study On Statistical Methods For Analyzing Gene Expression Microarray Data Under Mixed Linear Model Framework

Posted on:2007-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZouFull Text:PDF
GTID:2120360185960058Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Gene expression microarrays are new biotechnologies with enormous promise which open a new area on genome research achieved by simultaneously investigate hundred thousands of genes in one experiment. The essential and initial problem of gene expression microarray data analysis is to identify differentially expressed genes (DEGs), under certain conditions. Currently, the commonly used approaches for analyzing microarray data are cluster analysis and supervised grouping, especially in the applications of classification of tumor subtypes and different disease states. However, these methods only focus on the similarity of the data structure, and fail to guarantee that the used class predictors are biologically or statistically associated with class distinction. Therefore, prior to the cluster analysis, a DEG identification procedure is still necessarily required to screen out some genes which are significantly correlated with the treatment of interest (different tumor types, etc.).Since microarray experiments involve a series of complex procedures, such as RNA preparation, hybridization, and scanning, raw intensities are generally influenced by systematic experimental variation. In contrast to that from conventional biology experiment, data from microarray experiments is much more complicated because it has thousands of inter-related variables (genes), relatively small sample sizes (sometimes less than 10 arrays) from unbalanced design, and little or no replication and missing values because of the failed spots, which greatly challenges the statistical methods for identifying DEGs.Based on mixed linear model framework, we propose a new statistical strategy, which mainly focus on objectively identifying DEGs, for statistical analysis of gene expression microarray data. This method could handle microarray data with unbalance and missing values, and is extensible to more complex situations such as N-dyes, multiple factors decomposed from the treatment effect, other technical factors involving. Henderson method III is used to construct an F-statistic to test the significance of treatment effects of each gene, and the identified DEGs are ranked by their statistic scores which can provide more information and choice for biologists. Besides, in our study, estimates of magnitude of various sources of variations (variance components) like technical variations are provided for guiding further experimental protocol;predicted effects or estimates of GT of interest are availableas well for further analysis. The main issues and relative results are summarized as follows:1. We implement our statistical method in three steps. Firstly, microarray data are normalized to eliminate overall variations which are not gene specific involved in the experiments. Secondly, gene specific model fitting will be used to identify the DEGs. Based on Henderson method III, we construct a F-statistic to scale the expression differences between treatments of interest. False discovery rate (FDR) adjusted p-value is used to do statistical inference for controlling the experimental-wise type I error rate. Finally, these DEGs prejudged from the previous step will be combined to fit multi-gene model. MCMC algorithm is applied to obtain variance components, predicted effects of random effects and estimates of fixed effects, and it is also available for corresponding confident interval and significant test.2. A series of Monte Carlo simulations are conducted to examine the robustness and efficiency of the present method. Simulation results show that our strategy is capable of discovering significant genes with higher power in different experimental designs and data property.3. Computer program written by C++ language has been developed for gene expression data analysis into practices. This program has been packaged in newly developed statistical analysis software QTModel. The program provides the access to the selection of DEGs ranked by their statistic scores, with data either in simplest microarray experiment, or in complex experiments involving more than one experimental variable. In addition, various statistical algorithms are available in this program to estimate variance components, predict random effects or estimate fixed effects.4. Two real datasets, leukemia data and mouse brain data, are used as worked examples to illustrate the utility of the proposed method.
Keywords/Search Tags:Gene expression microarray, Mixed linear model, Henderson method III, MCMC, Factorial design
PDF Full Text Request
Related items