Optimizing High-throughput Biological Gene Sequencing Data Processing Algorithms Based On Hash

Posted on:2021-02-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:K Xu

Full Text:PDF

GTID:1360330632957839

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,the amount of data generated in the life sciences has dramati-cally increased propelled by the rapid development of high throughput sequencing technologies(commonly known as next-generation sequencing,NGS).Over the past few decades,the performance of processors has been increasing rapidly in accordance with Dennard Scaling[7]and Moore's Law[28,29].However,a bottleneck has been reached,and single-core processor performance improvement has been reduced from 22%per year in the 1980s to 3%now,which means that the speed of single-threaded code is stagnating.Therefore,on the one hand,high-throughput sequencing data has been in-creasing rapidly.On the other hand,the improvement of computing performance in recent years mainly focuses on the development of heterogeneous architecture.Therefore,how to deal with high-throughput sequencing data in the new archi-tecture is an urgent problem to be solved.In the process of high-throughput sequencing data,error correction and read alignment are the two steps of pre-processing.There are many researches on error correction and read alignment,but the research on heterogeneous architecture processor and large-scale data set is relatively less.How to improve the basic algorithm to reduce the amount of calculation.how to adapt to the characteristics of different architecture proces-sors and how to implement distributed implementation for large-scale data sets are problems to be solved.This paper mainly focuses on the algorithms and applications of read error correction and read mapping in the process of DNA se-quencing data on the new architectures,such as Intel multi-core CPU,many-core architecture KNC,KNL and Shenwei architecture SW26010.The main research results of this paper are as follows:1)Error correction is an important but time-consuming initial step when pro-cessing this data in order to improve the quality of downstream analyses.In this paper,we present SPECTR-a Scalable Parallel Error CorrecToR designed to improve the throughput of DNA error correction for Illumina reads on various parallel platforms.Our design is based on a k-spectrum approach where a Bloom filter is frequently probed as a key operation and is optimized towards AVX-512-based multi-core CPUs,Xeon Phi many-cores(both KNC and KNL),and heterogeneous compute clusters.A number of architecture-specific optimizations are employed to achieve high perfor-mance such as memory alignment,vectorized Bloom filter probing,and a stack-based iteration to eliminate recursion.Our experiments show that our optimizations result in speedups of up to 2.8,5.2,and 9.3 on a CPU(Xeon W-2123),a KNC-based Xeon Phi(31S1P),and a KNL-based Xeon Phi(7210),respectively,compared to a multi-threaded CPU reference im-plementation for the error correction stage.Furthermore,when executed on the same hardware.SPECTR achieves a speedup of up to 1.7,2.1,2.4,and 6.4,compared to the state-of-the-art tools Lighter,BLESS2,RECKONER,and Musket,respectively.In addition,our MPI implementation exhibits an efficiency of around 86%when executed on 32 nodes of the Tianhe-2 supercomputer.2)Read mapping is a performance-critical task,being one of the first stages re-quired for many different types of NGS analysis pipelines.In this paper,we introduce S-Aligner,a highly scalable read mapper designed for the Sunway TaihuLight supercomputer and its fourth-generation Shenwei many-core ar-chitecture(SW26010).S-Aligner employs a combination of optimization techniques to overcome both the memory-bound and the compute-bound bottlenecks in the read mapping algorithm.In order to make full use of the compute power of Sunway TaihuLight,our design employs three levels of parallelism:(1)internode parallelism using MPI based on a task-grid pat-tern,(2)intranode parallelism using multithreading and asynchronous data transfer to fully utilize all 260 cores of the SW26010 many-core processor,and(3)vectorization to exploit the available 256-bit SIMD vector registers.Moreover,we have employed asynchronous access patterns and data-sharing strategies during file I/O to overcome bandwidth limitations of the network file system.Our performance evaluation demonstrates that S-Aligner scales almost linearly with approximately 95%efficiency for up to 13,312 nodes(concurrently harnessing more than 3 million compute cores).Furthermore,our implementation on a single node outperforms the established RazerS3 mapper running on a platform with eight Intel Xeon E7-8860v3 CPUs while achieving highly competitive alignment accuracy.3)After the analysis of S-Aligner,we present SWMapper,a scalable and ef-ficient read mapper for the Sunway TaihuLight supercomputer.A number of optimization techniques are proposed to achieve high performance on its heterogeneous architecture which are centered around a memory-efficient succinct hash index data structure including seed filtration,duplicate re-moval,dynamic scheduling,asynchronous data transfer,and overlapping I/O and computation.Furthermore,a vectorized version of the banded Myers algorithm for pairwise alignment with 256-bit vector registers is pre-sented to fully exploit the computational power of the SW26010 processor.Our performance evaluation shows that SWMapper using all 4 compute groups of a single Sunway TaihuLight node outperforms S-Aligner on the same hardware by a factor of 6.2.In addition,compared the state-of-the-art CPU-based mappers RazerS3,BitMapper2,and Hobbes3 running on a 4-core Xeon W-2123v3 CPU,SWMapper achieves speedups of 26.5,7.8,and 2.6,respectively.Our optimizations achieve an aggregated speedup of 11 compared to the naive implementation on one compute group of an SW26010 processor as well as a strong scaling efficiency of 74%on 128 compute groups.

Keywords/Search Tags:

read error correction, read alignment, heterogeneous architecture, supercomputer

PDF Full Text Request

Related items

1	The Study On Read Alignment Algorithm For High-throughput Sequencing Datasets
2	Researches On Long Read Alignment Algorithms Oriented To The Third Generation Sequencing Technology
3	Nucleotide Sequence Alignment Based On Sparse Indexing Algorithm
4	A Long Read Hybrid Error Correction Algorithm Based On Segmented PHMM
5	A Study And Implementation Of High Throughput Algorithm For Long Read Error Correction
6	Parallel DNA Read Mapping Algorithms On Multi-core And Many-core Architecture
7	Indel Of Structural Variation Detection Algorithm Based On Split-read Method
8	A Long Read Based De Novo Method To Find Repetitive Element In The Genome
9	Algorithmic Study On Long Read Assembly Error Correction Based On Linked Reads Sequencing Data
10	The Read-through Expression Of Se-containing Single Chain Antibody With A Catalytic Triad And Research On Its Biological Effects