Statistical approaches for genome-wide association study and microarray analysis | | Posted on:2009-02-25 | Degree:Ph.D | Type:Dissertation | | University:Michigan Technological University | Candidate:Qin, Huaizhen | Full Text:PDF | | GTID:1443390002999698 | Subject:Statistics | | Abstract/Summary: | PDF Full Text Request | | For pedigree-based genome-wide association studies, we propose in Chapter 1 a data-driven weighting scheme accommodating pedigrees of any fixed size. The scheme remarkably outperforms optimal top R and optimal exponential weighting approaches in that it integrates their strong advantages with data-driven weighting and uses all available genotypic and phenotypic information. The association information conveyed by the children is partitioned into between- and within-pedigree components. In the screening stage, an accurate relative ranking of all markers is created using both the between-pedigree component and the founder information. In the testing stage, all the markers are group-wise tested using the weighted within-pedigree component which is scaled by the marker ranking from the screening stage. This scheme controls the family-wise error rate at the desired level regardless of linkage disequilibrium structure and population stratification and is completely robust in terms of power to population stratification up to a reasonable level.;In genome-wide association studies, population and family designs are well separately addressed with notable efforts on two-stage schemes. The trick of vast most existing two-stage schemes is to formally test the R most promising markers which are carefully selected from all genotyped markers. However, it is intractable to determine the optimal R. This difficulty may limit the usefulness of top R approaches. In chapter 2, we propose an informative weighting to efficiently utilize available natural population and family resources and ensure that all SNPs are formally tested. Analytically, we prove that the new approach rigorously controls the family-wise error rate at a desired nominal level. Empirically, our new scheme proves dramatically more powerful than prevailing one-stage and two-stage approaches, e.g., standard pedigree disequilibrium test, population based score test, the optimal exponential weighting scheme, and the optimal top R approach.;In genome-wide association studies, some causal variants may be completely untyped in that only the tag single nucleotide polymorphisms (SNPs) are genotyped. A remedy for uncovering these hidden causal marker loci is to accurately and rapidly impute their variants. We propose an efficient localized expectation-maximization (LEM) algorithm to impute the genotypes at untyped marker loci in a new study by utilizing available comprehensive reference catalogs and incorporating multi-locus linkage disequilibrium. The new approach significantly outperforms the Hidden Markov model (HMM) based approach in terms of imputation accuracy and computation efficiency. Applications to 22 chromosomes illustrate the practical advantages of our approach. Compared with the standard method, it notably improves imputation accuracy by 7% also and doubles the computation efficiency.;Microarray experiments typically analyze thousands to tens of thousands of genes from small numbers of biological replicates. The fact that genes are normally expressed in functionally relevant patterns suggests that gene expression data can be stratified and clustered into relatively homogenous groups. Cluster-wise dimensionality reduction should make it feasible to improve screening power while minimizing information loss. We propose a powerful and computationally simple method for finding differentially expressed genes in small microarray experiments. The method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. Comprehensive simulations show that our method is substantially more powerful than the popular SAM and eBayes approaches. We applied the method to three real microarray datasets: one from a Populus nitrogen stress experiment with 3 biological replicates; and two from public microarray datasets of human cancers with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes.;Correlation-based methods such as tight clustering have recently aroused interests in microarray analysis. Basically, available methods are mainly based on intuitive ideas, empirical experiments, and classical theory of sample correlation. Gene-to-gene correlation has special properties under distinct scenarios. However, there lack pertinent theoretical basis and analytical analysis of gene-to-gene correlation. Chapter 5 fills the bill for two-condition experiments. We obtain generic stochastic representations and asymptotic distributions with convergence rates of gene-to-gene correlation against the variations of differential magnitudes, residual correlation, and experiment size as well. Numerically, we illustrate the tail behaviors of intra-gene correlation. Results may serve as the theoretical basis for the forward search tight clustering as described in Chapter 4. | | Keywords/Search Tags: | Genome-wide association, Microarray, Chapter, Approaches, Tight clustering, Correlation, Weighting, Scheme | PDF Full Text Request | Related items |
| |
|