Font Size: a A A

Optimized Design And Implementation Of Gene Data Pre-processing Based On Cloud Computing

Posted on:2019-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2428330566486575Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Next-Generation Sequencing(NGS)technologies,the sequencing speed of gene data have been faster than Moore's Law and the cost of sequencing is lower.Gene data has been applied in many fields like health care.However,it's still difficult to meet the demand of timeliness,and it relies on the support of high-performance hardware and software tools to meet the urgent needs of large-scale genetic data analysis.Cloud computing has many advantages such as ultra-large-scale,virtualization,high reliability,versatility,and high scalability.Using cloud computing can solve the problems of genetic data processing at a lower cost.Based on the Spark cloud computing platform,this thesis optimizes the current genetic data preprocessing pipeline to realize the parallel processing of genetic data using multi-core and multi-nodes and to improve the timeliness of genetic data preprocessing.This thesis firstly investigates the features of the current genetic data preprocessing,then take advantage of these features to reduce the IO cost of reading and writing hard disks in the original process.And the program structure of the main two tools in the data preprocessing,which are the sequence alignment tool and the duplicate data markup tool,and the main two tools are optimized and implemented in the Spark environment.Based on the excellent scalability,good performance,and high computation-intensity of the original sequence alignment tool BWA,the PipeBWA,a framework that can run BWA in the Spark environment,was designed.By optimizing the sequencing result data storage and applying a better external program calling method,the framework has many advantages such as lightweight,extensible,and compatible with the features of any version of the BWA tool.Sequence alignment experiments on real genetic datasets show that PipeBWA takes only one-third of GATK4 which is the state-of-the-art cluster genetic data processing tool.MarkDuplicates,a tool in Picard,is mainly used to mark duplicate data in genetic data preprocessing.However,the MarkDuplicates tool is unable to split the input data to complete the data in parallel,and the core program can only be executed serially.By mining its data-parallel module,the duplicate data detection tool DeDuplicatesSpark is implemented on Spark.By using multiple stages of aggregation to find candidate areas to reduce the size of the key-value pairs,the alignment of results data completes storage optimization,key value compression,bitmap indexing,and Spark SQL columnar aggregation.Experiments on actual genetic datasets show that the performance of DeDuplicatesSpark is improved very significantly.Compared with the MarkDuplicates tool in Picard and the MarkDuplicatesSpark tool in GATK4,the performance of DeDuplicates Spark is improved by tens times.In order to solve the problem of incomplete data in the calculation process caused by distributed storage,and make better use of the performance improvement brought by distributed storage,this thesis redesigned the storage format of distributed sequencing result data and alignment result data,to ensure the optimal design of PipeBWA and DeDuplicatesSpark can effectively improve the performance of the genetic data preprocessing process.Experiments show that the genetic data preprocessing procedure optimized in this thesis can effectively reduce the time-consuming of the original genetic data preprocessing,and provides a good foundation for improving the timeliness of the genetic data analysis.
Keywords/Search Tags:NGS, sequence alignment, columnar storage, Spark, GATK
PDF Full Text Request
Related items