| With the release of the fine-scale map of human genome, genome-wideassociation study (GWAS) has developed rapidly and becomes an important approchto detect genetic factors of complex disease. Because of the ability to increase thedensity of single nucleotide polymorphisms (SNPs) in study data and increase thepower of GWAS for findding disease causal variants, imputation based GWAS hasbeen widely used in studies.However, there are two problems now in practicalapplications of this methed, one is the lackage of integrated system tools to run theentire GWAS pipeline form data processing to analysis, and the other is that currenttools for genotype imputation and association test in GWAS can not effectively copewith the increasement of the data and computation amounts which are caused by theincrease of the reference data amount.On the basis of the research on imputation based GWAS and Hadoop, in thispaper, we developed a GWAS system based on Hadoop called CloudAssoc.CloudAssoc consists of three modules which are data preparing, genotype imputationand association test. Data preparing module is used for data-convertion and qualitycontrol. Genotype imputation module is implemented on Hadoop platform and usedfor predicting untyped SNPs with public data. Association test module is also basedon Hadoop and it is used for the analysis of single SNP association test of imputateddata.The key reason for CloudAssoc to improve GWAS efficiency is the paralizationof genotype imputation and association test. In this paper, based on the study of themodle and algorithm that IMPUTE2use, by splitting position interval to analysis intosmall ones, we split a big task with large time and resource consumption into smalltasks which are distributed to a Hadoop cluster. By executing these tasks based onHadoop streaming, we implemeted the paralization of imputation module.Parallization of association test module is implemeted in the same manner.At the last of this paper, the system was tested. Firstly, for parallelizationsoftware of CloudAssoc, the scalability, the efficiency, and the relationship betweenrunning time and splitting data interval were tested. The results show that theparallelization software has a near-linear speed up, good scalability and efficiency. Finally, CloudAssoc was tested and the result shows that the system can efficientlycomplete a genome wide imputation based GWAS pipeline. |