Font Size: a A A

Development And Application Of Data Analysis Algorithms In The Studies Of Complex Diseases

Posted on:2017-06-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J W ShenFull Text:PDF
GTID:1360330590990943Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Genetic factors play a major role in the development of complex diseases.The etiology of complex disease is thought to be multifactorial,with many susceptibility genes interacting with each other or with several environmental factors.However,our understanding of the causes of many complex diseases are limited,and a large proportion of genetic risk factors remain to be explored.Genetic studies of complex diseases will help us understand the pathogenesis of diseases and thus provide important evidence for the early-stage prevention and diagnosis,as well as drug discovery.Therefore,the genetic studies of complex diseases are of great significance.Currently,the general steps for the genetic studies of complex diseases involve the following:(1)perform high-throughput experiments and obtain the genotypes of SNPs.(2)perform the quality control and population structure analysis.(3)conduct single locus association analysis.(4)conduct gene interaction analysis.(5)construct the molecular network.The results of genetic studies on complex diseases could provide vital information for the genetic diagnosis and prevention of diseases.And these information,along with the application of next generation sequencing,will direct clinical decision and finally contribute to personal medicine.The first three studies of this thesis will be based on(2),(3),(4)of the general steps for the genetic studies of complex diseases.And in the forth study,a new algorithm for non-invasive prenatal diagnosis by next generation sequencing technology is proposed.Study I:Population stratification refers to the presence of a systematic difference in allele frequencies between populations,possibly due to different ancestry.It is a problem in genetic association studies because it is likely to highlight loci that underlie the population structure rather than disease-related loci.At present,principal component analysis(PCA)has been proven to be an effective way to correct for population stratification.However,the conventional PCA algorithm is time-consuming when dealing with large datasets.Thus,we developed a graphic processing unit(GPU)-based PCA software named SHEsisPCA that is highly parallel with a highest speedup greater than 100 compared with its CPU version.A cluster algorithm based on X-means was also implemented as a way to detect population subgroups and to obtain matched cases and controls in order to reduce the genomic inflation and increase the power.We used SHEsisPCA to perform population structure analysis on an African population.We found that the cluster assignments formed by first two principal components clearly correlated with the specific ethnic groups.A study of both simulated and real datasets showed that SHEsisPCA ran at an extremely high speed while the accuracy was hardly reduced.Therefore,SHEsisPCA can help correct for population stratification much more efficiently than the conventional CPU-based algorithms.Study ?:Association analysis is one of the important methods for genetic studies.Currently,algorithms and software for genetic analysis of diploid and bi-allelic organisms are well-established.However,polyploidy is common in plants.Multi-allelic markers,such as microsatellites and copy number polymorphisms(CNPs),are also frequently used by researchers.Here,we present SHEsisPlus,the online algorithm toolset for dichotomous and quantitative trait genetic analysis on multi-allelic(?2)markers of polyploid species.It's free,open source,user-friendly and also designed to perform a range of analyses,including haplotype inference,linkage disequilibrium analysis,epistasis detection,Hardy-Weinberg equilibrium and single locus association tests.Meanwhile,we developed an accurate and efficient haplotype inference algorithm for polyploids and proposed an entropy-based algorithm to detect epistasis in the context of quantitative traits.A study of both simulated and real datasets showed that our haplotype inference algorithm was much faster and more accurate than existing ones.Our epistasis detection algorithm was the first try to apply information theory to characterizing the gene interactions in quantitative trait datasets Results showed that its statistical power was significantly higher than conventional approaches and it won't be affected by single-locus marginal effects.SHEsisPlus is the first online platform for association studies on polyploidy and multi-allelic organismsStudy III:Prostate cancer is one of the most common carcinomas among adult males.Recently,genome-wide association studies(GWAS)have identified several susceptibility genes of prostate cancer.However,these single locus results can only explain?13%of the genetic etiology In order to understand how multiple genetic variants may contribute to the penetrance of prostate cancer,we conducted a genome-wide gene-gene interaction study in four populations(African American,European,Latino American,Japanese),involving 5,269 cases and 5,289 controls in total.We exhaustively evaluated all pairs of SNP-SNP interactions for 661,658 SNPs that are consensus in all four groups,and then performed a meta-analysis to combine the results.We found that multiple variants within region 7p21.3 and 18p11.22 significantly interact with each other and reached a stringent genome-wide significance level·(2.28×10-13).The most significant epistasis was detected between rs1105255(intergenic,near RBSG3)and rs651431(intergenic,near VAPA)(p=1.4×10-14).Notably,VAPA has been identified to be the protein-coding transcripts as PTEN competing endogenous RNA(ceRNA)in prostate cancer.And PTEN is a critical tumor suppressor gene which is frequently altered in human cancers.Previous studies have identified multiple susceptibility loci in region 7p21.Multiple regulatory elements were also found within 7p21.3 and 18p11.22,indicating that the variants might regulate the nearby genes(VAPA,RBSG3,etc.)and confer risk of the disease.Additionally,we also found several other significant epistasis pairs,most of which were near or in cancer-related genes.Drug targetenrichment analysis suggested that genes in top epistasis significantly overlapped with target genes of FDA-approved drugs for treatment of prostate cancer.Previous studies already showed that the results of genome-wide single locus association study could provide valuable information for drug discovery.Here we proved that results of genome-wide gene interaction study could also offer such importantinformation.This indicated that human genetic data could be efficiently integrated with other biological information to derive biological insights and drive drug discovery.Study IV:Noninvasive prenatal detection of fetal chromosomal aneuploidies(such as fetal trisomy 21,fetal trisomy 18 and fetal trisomy 13)by high throughput next-generation sequencing proves to be an accurate and sensitive way.Currently,most of the data analysis methods that use next generation sequencing involve a Z-score test,which is based on the reference distribution of at least dozens of normal samples.This is not only costly but also time-consuming.Moreover,as the experimental condition(humidity temperature,air pH,machine status,personal error etc.)varies between every single run,noises cannot be eliminated and will skew the results.In order to overcome these drawbacks,we have proposed a new analytical strategy based on the multiplex barcoding sequencing of both normal and unknown samples in a single run on Ion Torrent PGM.In this method,only one normal sample is required.By applying this method to 13 single runs with a total number of 44 samples,we achieve the sensitivity and specificity of 100%and 95.181%for T13 detection,100%and 100%for T18 detection,90%and 100%for T21 detection,respectively.
Keywords/Search Tags:complex disease, single nucleotide polymorphism, population stratification, haplotype inference, epistasis, non-invasive prenatal diagnosis
PDF Full Text Request
Related items