Font Size: a A A

Gene-or Region-Based Statistics For Genome Wide Association Study Via Dimension Reduction Techniques

Posted on:2012-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q S GaoFull Text:PDF
GTID:2214330338464104Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Many common human diseases, such as cancer, schizophrenia, essential hypertension, type 2 diabetes, and cardiovascular disease, are known to be complex diseases. Complex diseases, also known as multifactorial diseases, are controlled by multiple genetic and environmental factors. Although they often show a tendency for family aggregation, complex diseases do not have a clear-cut pattern of inheritance. This makes it difficult to determine one's risk of inheriting or passing on these disorders. Recently with rapid improvements in high-throughout genotyping techniques and the growing number of available markers, genome-wide association studies (GWAS), which genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) on thousands of participants, are emerging as promising approaches for the identification of SNPs that are marginally associated with complex diseases. On the other hand, researches on gene-gene interactions (epistasis) in GWAS have shed light on some disease-associated pathways and networks to some extent, and improved our understanding of the genetic basis of complex diseases despite the computational challenge. However, there are still many analytic and interpretation challenges in GWAS. It is customary to run SNP-based association or interaction tests in the whole genome to identify causal or associated SNPs with strong marginal or jointly epistasis effects on disease or traits.In other words, the unit of association is the SNP. However, such a SNP-based analysis usually leads to computational burden and the well-known multiplicity problem, with a highly inflated risk of type I error and decreased ability to detect modest effects. In the present study, higher units, such as gene or genome regions, were considered to deal with these and related challenges. Under the framework, we proposed four methods to detect disease-associated genes or gene-gene interactions in the genome, presented in four chapters as follows:Chapter 1 A new method to test the nonlinear feature in nonlinear principal component analysis Given the SNPs allocated into genes or regions, the issue of how to evaluate genetic association for each candidate gene or genome region remains. As powerful multi-marker analysis methods, PCA-based methods are often applied in the gene- or region- based association study. PCA can capture linkage disequilibrium information and avoid multicolinearity between SNPs within a candidate gene/region. However, it only extracts the linear relationship between SNPs. For nonlinear situation, the PCA-based methods will lose power, and a nonlinear PCA model should be used. Therefore, in present study, we introduced a nonlinear measure determine whether the underlying relationship within a given variable set can be described by a linear PCA model or whether nonlinear PCA model must be utilized for further study. Applications to two simulated data and the data from GAW16 are described to demonstrate its performance. In the two simulated examples, as expected, no violations of the accuracy bounds arise in the linear example while some of the residual variances fall outside the accuracy bounds in the nonlinear example. For the real data, at least one of the residual variances fall outside any of the accuracy bounds, implying that a nonlinear PCA model is required for this data set. These results show that the new nonlinearity measure is effective to detect the relationships between variables in a given data set. With this measure, we can choose a more suitable model to make optimal use of all information available in the given data set.Chapter 2 Gene- or region- based association study via kernel principal component analysis For linear data, PCA-based methods are better choices for the following association study, while nonlinear approaches should be applied for nonlinear data. Among the modified nonlinear PCA methods, the kernel PCA (KPCA) is the most well known and widely adopted. In this study, we proposed to combine KPCA with logistic regression test (LRT) to detect the association between multiple SNPs in a candidate gene or genome region and diseases or traits. The algorithm conducted KPCA first to account for between-SNP relationships in a candidate region, and then applied LRT to test the association between kernel principal components (KPCs) scores and diseases. Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR=1.2, 1.3). Application to the four regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop 16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT. KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.Chapter 3 Exhaustive sliding-window scan approach for genome-wide association study via PCA-based logistic model The gene- or region-based approaches mentioned above, including our newly proposed KPCA-based method, will definitely improve our understanding of the genetic basis of complex diseases. However, all of these approaches only allow a gene or genome region of several to tens of markers. For a large number of SNPs across the candidate region or the human genome, the performance of these methods will not be satisfying. In recent years, sliding-window methods, in which several neighboring SNPs together included in a "window", have been a popular strategy of automated GWAS data analysis. In these sliding-window approaches, the candidate region or the whole genome is divided into many contiguous overlapping windows, followed by gene- or region-based multi-locus association methods in each window. Sliding-window approach can be implemented with the fixed window size or variable sizes. However, we are not certain whether the window sizes previously set or decided by specific methods are statistically sufficient to gain the optimal detection power. Lin et al proposed that an exhaustive search of all possible windows of SNPs at the genome level is not only computationally practical but also statistically sufficient to detect common or rare genetic-risk alleles. With the development as well as the extensive applications of multiprocessor and multithreading computational technique, the "exhaustive" methods have been more feasible in practice. At present study, under the framework of "exhaustive" search, we first conducted simulations to assess statistical powers with different window sizes, and then evaluated the performance via application to real data to test whether the exhaustive strategy can be extended in GWAS data analysis. Results from both simulation and real data analysis indicated that the powers and p-values with different window sizes were quite different. Furthermore, with the development of multiprocessor computational technique, the proposed exhaustive strategy combined with the cluster computer technique is computationally efficient and feasible for analyzing GWAS data. The exhaustive strategy is computationally efficient and feasible, so it should be popularized in GWAS data analysis. Chapter 4 A new gene- or region-based method for detecting gene-gene interactions between two unlinked loci via kernel canonical correlation analysis For GWAS data set, it is often of interest to identify SNPs that jointly have an epistatic (interaction) effect on complex diseases. However, most of the current methods consider SNP as the unit of association, which leads to several well-know limitations such as multiple testing. Under the gene- or region-based framework, our group have previously proposed a gene-based statistic (CCU statistic) for detecting gene-gene co-association based on canonical correlation analysis (CCA). Apparently, in the case that the two genes of interest are unlinked, the co-association between them is the same as their interaction effect. The CCU statistic has been proved to have good performance on detecting gene-gene co-associations or interactions. Despite that, CCA can only detect linear structure of the data set. If the genomic data contains nonlinear structure, CCA will not be able to detect it. In recent years, kernel CCA (KCCA), as a generalized CCA, has been studied intensively in the field of machine learning, face recognition and data classification, and has been claimed success in many applications. We, therefore, proposed to use KCCA rather than CCA to construct a revised version of CCU statistic-kernel CCU (KCCU) statistic-for detecting gene-gene interaction in association study. Simulation results showed that all the powers of KCCU statistic were higher than CCU statistic at given significant levels, sample sizes and relative risks. Application to RA data in GAW16 Problem 1 showed that CCU statistic only detected the interaction between PTPN22 and C5 genes, while KCCU statistics identified all the pairwise interactions among the four genes. In summary, KCCU statistic had better performance than CCU statistic.
Keywords/Search Tags:nonlinearity measure, gene- or region-based, kernel principal component analysis, exhaustive sliding-window, kernel canonical correlation analysis
PDF Full Text Request
Related items