Font Size: a A A

Design And Implementation Of Variant Detection Algorithm Based On Cloud Computing

Posted on:2019-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z WuFull Text:PDF
GTID:2428330566486576Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Variant detection is the basic of High-throughput sequencing's data analysis and is widely used in disease research,clinical treatment and the research of new drug.However,with the development of high-throughput sequencing technology and the reduction of the sequencing's cost,the variant detection algorithm,which is traditionally applied to a single machine,cannot meet the large requirement of processing the current massive amount of genome data timely.Cloud computing provides an easy-to-use,efficient parallel processing framework and a scalable storage system.It is an excellent distributed solution,but it also brings problems with requirements on the scalability and storage of the algorithm.Only when the algorithm has met the requirement of both good scalability and appropriate design of distributed storage can the computing performance of multiple machines be made full use of.Based on the current mainstream variant detection algorithm,Haplotype Caller,this thesis designs and implements variant detection algorithm Cloud HC using the Spark in-memory computing framework and distributed storage system in cloud computing.This algorithm not only keeps the detection results highly consistent with the results of Haplotype Caller,but also has higher scalability.This thesis mainly includes:(1)Aiming at solving the problem of computation skew of Haplotype Caller,a parallel strategy of adaptive data segmentation is proposed and a variant detection algorithm based on adaptive data segmentation(ADS-HC)is designed and implemented.Experiments show that the elapsed time of this strategy which has a nearly linear speedup is less than Spark's commonly strategy partitioning data into equal-sized blocks,whether in single-node or multi-node.(2)For the requirement that adjacent data blocks should have overlapped boundaries,Hadoop-BAM library is customized to implement partitioning BAM file format into overlapped blocks.Experiments show that with the overlapped data block the consistency of the results has been improved.(3)The Kudu storage solution is adopted and designed for the feature that after cache optimization most known variants are read randomly and others are read sequentially.Experiments show that,whether it is single-node or multi-node,the optimized Kudu read solution is better than the solution of HBase and HDFS.In general,the result consistency between Cloud HC and GATK3.8 is more than 99.9%.In terms of scalability,Cloud HC achieves a speedup of 17 or more at a single node with 16 physical cores(32 logical cores)and a speedup of 62 or more at 4 nodes,which are far better than GATK3.8's and GATK4.0's.
Keywords/Search Tags:Variant Detection, Spark, Distributed Storage, Computation Skew
PDF Full Text Request
Related items