Font Size: a A A

Design And Implementation Of GATK Genetic Analysis Software Parallel Acceleration Scheme

Posted on:2020-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:R G HuangFull Text:PDF
GTID:2404330590983056Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Genomic sequencing big data analysis is the basis of precision medical clinical treatment.Precision medicine is a medical model that uses genetic data analysis to accurately find the cause and treatment of the disease.GATK(Genome Analysis Toolkit)is one of the most commonly used software for genome sequencing big data analysis and an indispensable toolkit for almost all types of genetic data analysis.However,GATK is extremely slow,which is greatly limit its role in clinical medical practice.To solve these problems,this paper mainly focuses on the acceleration studies of GATK,and proposes a FPGA acceleration GATK Best Practices distributed system solution which is based on Spark and Hardware Acceleration.The work of this paper includes: Firstly,aiming at the problem that GATK can only run on an inefficient single-machine operation mode,this article develops an expandable implementation of distributed GATK parallelization acceleration scheme.Compared with the similar distributed acceleration scheme,this scheme has deeply studied and discussed the problem of data skew in distributed applications.Secondly,this paper proposes an acceleration scheme based on hardware acceleration and parallelization by studying the process of MuTect2.Compared with other MuTect2 acceleration schemes,this scheme adapts different types of gene sequencing data and achieves better acceleration effects.successfully reducing the runtime of the MuTect2 and GATK.The proposed scheme has been successfully commercialized.The experimental results show that: compared with the original GATK,the acceleration scheme proposed in this paper has excellent acceleration performance while ensuring the correctness of the results,and can further reduce the running time by adding distributed nodes to distributed clusters.
Keywords/Search Tags:Gene, Precision medicine, GATK, Spark, Distribution, parallelization, MuTect2, Hardware acceleration
PDF Full Text Request
Related items