Font Size: a A A

Associated Analysis Based On Next-generation Sequencing Technology Research And Population Structure Theory

Posted on:2012-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:K C XiaoFull Text:PDF
GTID:1110330371965449Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Next-generation sequencing technologies can effectively detect the entire spectrum of genomic variation and provide a powerful tool for systematic exploration of the universe of common, low frequency and rare variants in the entire genome. The 1000 Genomes Project (1000G) represents one of such endeavors to characterize the human genetic variation pattern at the MAF=1% level as a foundation for association studies, provides a set of data including SNP, INDELs and CNVs.However, the current paradigm for genome-wide association studies (GWAS) is to catalogue and genotype common variants (5%<MAF). The methods and study design for testing association of low frequency (0.5%< MAF≤5%) and rare variation (MAF≤0.5%) have not been thoroughly investigated. In here, we explored different strategies and study design for the near future GWAS in the post-era, based on both the 1000 Genomes low coverage pilot data and exon pilot data.We investigated the linkage disequilibrium (LD) pattern among common and low frequency SNP and its implication for association studies. We found that the LD between low frequency alleles and low frequency alleles, and low frequency alleles and common alleles are much weaker than the LD between common and common alleles. We examined various tagging designs with and without statistical imputation approaches and compare their power against de novo resequencing in mapping causal variants under various disease models. We used the low coverage pilot data which contain~14M SNP as a hypothetical genotype-array platform (Pilot 14M) to interrogate its impact on selection of tag SNP, mapping coverage and power of association tests. We found that even after imputation we still observed 45.4% of low frequency SNP which were untaggable and only 67.7% of low frequency variation was covered by Pilot 14M array. This suggests GWAS based on SNP arrays would be ill-suited for association studies of low frequency variation.The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the "intrinsic dimensionality" of sequence data which determines the structure of populations is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASEⅢMexico dataset of the HapMap. Our results demonstrated that the LLE outperforms PCA for population structure analysis. We observed that 25.1%,44.9% and 21.4% of the common variants and 89.2%,92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants which are often private to specific populations have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.
Keywords/Search Tags:Whole Genomic Sequencing, Association Analysis, Linkage Disequilibrium, Population Structure, Locally Linear Embedding, Principal Component Analysis, Dimensionality Reduction, LASSO
PDF Full Text Request
Related items