Design And Implementation Of Genome-wide Association Study System Based On Hadoop

Posted on:2013-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:Q W Wang

Full Text:PDF

GTID:2254330392470588

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the release of the fine-scale map of human genome, genome-wideassociation study (GWAS) has developed rapidly and becomes an important approchto detect genetic factors of complex disease. Because of the ability to increase thedensity of single nucleotide polymorphisms (SNPs) in study data and increase thepower of GWAS for findding disease causal variants, imputation based GWAS hasbeen widely used in studies.However, there are two problems now in practicalapplications of this methed, one is the lackage of integrated system tools to run theentire GWAS pipeline form data processing to analysis, and the other is that currenttools for genotype imputation and association test in GWAS can not effectively copewith the increasement of the data and computation amounts which are caused by theincrease of the reference data amount.On the basis of the research on imputation based GWAS and Hadoop, in thispaper, we developed a GWAS system based on Hadoop called CloudAssoc.CloudAssoc consists of three modules which are data preparing, genotype imputationand association test. Data preparing module is used for data-convertion and qualitycontrol. Genotype imputation module is implemented on Hadoop platform and usedfor predicting untyped SNPs with public data. Association test module is also basedon Hadoop and it is used for the analysis of single SNP association test of imputateddata.The key reason for CloudAssoc to improve GWAS efficiency is the paralizationof genotype imputation and association test. In this paper, based on the study of themodle and algorithm that IMPUTE2use, by splitting position interval to analysis intosmall ones, we split a big task with large time and resource consumption into smalltasks which are distributed to a Hadoop cluster. By executing these tasks based onHadoop streaming, we implemeted the paralization of imputation module.Parallization of association test module is implemeted in the same manner.At the last of this paper, the system was tested. Firstly, for parallelizationsoftware of CloudAssoc, the scalability, the efficiency, and the relationship betweenrunning time and splitting data interval were tested. The results show that theparallelization software has a near-linear speed up, good scalability and efficiency. Finally, CloudAssoc was tested and the result shows that the system can efficientlycomplete a genome wide imputation based GWAS pipeline.

Keywords/Search Tags:

Genome-wide Association Study, Hadoop, Imputation, SNPsAssociation Test, Paralization

PDF Full Text Request

Related items

1	Genotype Imputation For Genome-wide Association Data Identify Novel Loci Associated With SLE
2	Genome-wide Association Study And HLA Region Fine Mapping Study Of Syphilis
3	Two-stage Design And Analysis For Genome-wide Association Studies
4	Genome-wide Association Studies Of Specific Anti-nuclear Autoantibody Sub-phenotypes In Primary Biliary Cholangitis
5	Identification Of The Genetic Association Between Ischemic Stroke And Parkinson’s Disease Using Genome-wide Association Study
6	Genome-wide Association Study Of Body Composition Index And Association Of SOST Gene Polymorphisms With Peak BMD
7	Genome - Wide Association Of HBV - Associated Liver Diseases
8	Genome-wide Association Study Of Cerebrospinal Neurofilament Light Levels In Non-demented Elders
9	Multi-step Correlation Test Based On The Genome Of Any Family Structure Data
10	Genome-Wide Association Study Of Prognosis And Toxicity In Non-Small Cell Lung Cancer Patients Receiving Platinum-based Chemotherapy