Font Size: a A A

Research On Optimizing Processing Algorithms For Next-generation Sequencing Data

Posted on:2024-06-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:1520306923477124Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of life science and technology,especially the rapid development of next-generalization sequencing(NGS)technology,the amount of sequencing data has grown exponentially,surpassing Moore’s Law.Moreover,computer science and technology are also developing,and multi-core processors have become the mainstream for modern workstations and servers.Therefore,many NGS data analysis pipelines have been developed,such as meta-genomic pathogen detection and variant calling pipelines.How to fully utilize the parallel processing performance of modern multi-core processors to handle such a large amount of data is an urgent problem that needs to be solved.This thesis focuses on the performance and algorithmic optimization of two classical sequencing data processing pipelines:pathogen detection and variant calling,by fully using the computational performance of modern multi-core systems to address the efficiency problems.The main contributions of this thesis are as follows:1.The first problem to be solved is how to read and parse sequencing data quickly.To address this issue,we propose RabbitFX,which can efficiently parse FASTA/Q format files on a multi-core platform.RabbitFX uses a data chunk-based formatting approach.Firstly,RabbitFX reads sequencing data on disk into data chunks.Then RabbitFX parses these chunks in multi-threading.We use a concurrent data pool to manage data chunks,minimize memory footprint,and avoid frequent allocations.For FASTQ format data,RabbitFX supports single-end(SE)and paired-end(PE)sequencing data.For FASTA format data,RabbitFX adopted a linked-list strategy to ensure that a complete sequence can be accessed in one list.When parsing gzip-compressed files,RabbitFX uses a highly optimized library IGZIP to decompress data in a streaming fashion,which allows RabbitFX to parse FASTA/Q data at a comparable speed to uncompressed data.As a case study,we integrate RabbitFX into three commonly used bioinformatics tools:fastp,Ktrim,and mash.The integration of RabbitFX achieves a speedup of at least 11.6x(6.6x),2.4x(2.4x),and 3.7x(3.2x)for uncompressed(gzip format compressed)files compared to the original version.Thus demonstrating the utility and effectiveness of RabbitFX.2.For the meta-genomic pathogen detection pipeline,we propose the RabbitV toolset,including the unique k-mer generation tool RabbitUniq and the unique k-mer-based pathogen detection tool RabbitV.RabbitUniq solves the problem of memory explosion when handling large datasets.k-mers are divided into different bin files(on disk)according to their signatures.This binning strategy ensures that different bins do not contain identical k-mers,which enables the processing of bins in parallel.Then the bin files are processed concurrently,and unique k-mers are identified by marking duplicate k-mers in different genomes.RabbitUniq can generate the unique k-mer of bacterial references(355GB in total)in 40 minutes,while its fellow functional competitor UniqueKMER cannot finish this task for memory reasons.RabbitV solves the efficiency problem when processing extensive meta-genomic sequencing datasets.RabbitV uses RabbitFX as the data parsing module and SSE/AVX-based vector instructions for computationally intensive core functions to optimize program performance.When loading large unique k-mer datasets,RabbitV adopts a multi-threaded loading strategy to improve loading efficiency significantly.Besides,RabbitV encodes k-mer at the data generation step instead of the detection step,thus reducing storage space and accelerating the loading efficiency.RabbitV is able to detect SARS-CoV-2 in about 5 minutes for 40 samples(255GB in total).3.For the whole genome sequencing(WGS)variant calling pipeline,we propose DeepFilter,a variant filter based on VarDict calling results.DeepFilter uses synthetic datasets for neural network training,the real-world dataset and another synthetic dataset for evaluation.The evaluation results show that DeepFilter outperforms VarDict’s built-in filter and other third-party tools in terms of SNV and indel variants filtering tasks.4.Further,to improve the running efficiency and accuracy of the variant calling pipeline,we propose RabbitVar,an ultra-fast and accurate somatic small-variant calling on multi-core architectures.RabbitVar takes full advantage of the computational performance of multi-core computing platforms.RabbitVar is highly optimized by featuring multi-threading,a highperformance memory allocator,efficient data structures,and vectorization on modern multi-core CPUs.The combination of these optimizations makes it both highly efficient and scalable.For high-depth sequencing datasets,the runtime of Rabbit Var is linear with the sequencing depths.Then,RabbitVar uses an XGBoost-based filtering model to filter the candidate variants further.To evaluate the accuracy and generalization of RabbitVar for SNV and indel calling results,an extensive and comprehensive validation was performed using real tumor-normal datasets,including different sequencing conditions,sample purity,and sequencing depths.The evaluation results demonstrate that Rabbit Var achieves highly competitive F1scores when calling SNVs.Moreover,when calling the more challenging indel variants,it consistently achieves the highest Fl-scores.
Keywords/Search Tags:High-performance Computing, Next-generation Sequencing Data, Somatic Variant Calling, Pathogen Detection, Bioinformatics
PDF Full Text Request
Related items