Design And Implementation Of Variant Detection Algorithm Based On Cloud Computing

Posted on:2019-09-04

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Wu

Full Text:PDF

GTID:2428330566486576

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Variant detection is the basic of High-throughput sequencing's data analysis and is widely used in disease research,clinical treatment and the research of new drug.However,with the development of high-throughput sequencing technology and the reduction of the sequencing's cost,the variant detection algorithm,which is traditionally applied to a single machine,cannot meet the large requirement of processing the current massive amount of genome data timely.Cloud computing provides an easy-to-use,efficient parallel processing framework and a scalable storage system.It is an excellent distributed solution,but it also brings problems with requirements on the scalability and storage of the algorithm.Only when the algorithm has met the requirement of both good scalability and appropriate design of distributed storage can the computing performance of multiple machines be made full use of.Based on the current mainstream variant detection algorithm,Haplotype Caller,this thesis designs and implements variant detection algorithm Cloud HC using the Spark in-memory computing framework and distributed storage system in cloud computing.This algorithm not only keeps the detection results highly consistent with the results of Haplotype Caller,but also has higher scalability.This thesis mainly includes:(1)Aiming at solving the problem of computation skew of Haplotype Caller,a parallel strategy of adaptive data segmentation is proposed and a variant detection algorithm based on adaptive data segmentation(ADS-HC)is designed and implemented.Experiments show that the elapsed time of this strategy which has a nearly linear speedup is less than Spark's commonly strategy partitioning data into equal-sized blocks,whether in single-node or multi-node.(2)For the requirement that adjacent data blocks should have overlapped boundaries,Hadoop-BAM library is customized to implement partitioning BAM file format into overlapped blocks.Experiments show that with the overlapped data block the consistency of the results has been improved.(3)The Kudu storage solution is adopted and designed for the feature that after cache optimization most known variants are read randomly and others are read sequentially.Experiments show that,whether it is single-node or multi-node,the optimized Kudu read solution is better than the solution of HBase and HDFS.In general,the result consistency between Cloud HC and GATK3.8 is more than 99.9%.In terms of scalability,Cloud HC achieves a speedup of 17 or more at a single node with 16 physical cores(32 logical cores)and a speedup of 62 or more at 4 nodes,which are far better than GATK3.8's and GATK4.0's.

Keywords/Search Tags:

Variant Detection, Spark, Distributed Storage, Computation Skew

PDF Full Text Request

Related items

1	Research On Partition Loading Balance Based On Spark Data Skew
2	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark
3	Research Of Performance Optimization For Data Skew Based On High-speed Networks
4	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
5	Research On And Application Of The Solution For Spark Data Skew Scenarios
6	The Design And Implementation Of Log Real-time Analysis System Based On ELk Stack And Spark
7	Research Of Data Skew On Spark Based On Imporved Partition Method
8	Research On Big Data Distributed Storage Technology Based On Spark
9	Research On Customs Commodity Risk Tax Detection Based On Spark Platform
10	Spark Task Scheduling With Data Skew And Deadline Constraints