Font Size: a A A

Research On Compression Algorithm And Parallelism Of SAM Gene Sequence

Posted on:2024-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:H FaFull Text:PDF
GTID:2530307136489154Subject:Software engineering
Abstract/Summary:
The implementation of precision medicine,the development of novel drugs and the revelation of genetic laws cannot be achieved without the powerful support of genetic data.With the emergence of large-scale gene sequencing projects,the data volume of SAM gene sequences is exploding.In order to reduce the storage cost of massive SAM gene data,data compression technology is particularly important.Reference-free compression methods are highly flexible,but existing algorithms do not fully utilize the association information contained in SAM files during encoding and only consider redundancy under the original order when compressing QUAL fields.In addition,the parallel processing of existing algorithms is limited by the multi-threaded operation within a single machine,which cannot meet the demand for high performance.This thesis focuses on data compression and parallel algorithms for SAM gene sequences,which can provide technical support for gene sequencing related applications,with the following main contributions:(1)A compression algorithm ZSAM based on hypothetical reference sequences and two-level reordering is proposed to address the low compression rate of current reference-free sequence algorithms.The algorithm sorts QUAL fields according to frequency score and graph similarity score to increase the local redundancy.The CIGAR field is introduced to perform difference encoding for the SEQ sequences whose RNAME is available.A Bloom filter is used to construct high-quality indexes,and then match encoding is performed for the SEQ sequences whose RNAME is not available.Appropriate encoding and compression tools are respectively selected according to the types and variation characteristics of the remaining fields.The experimental results show that the compression rate of the ZSAM algorithm outperforms the existing reference-free sequence algorithms.(2)A Spark-based parallel compression algorithm for SAM gene sequences,ZSAM-Spark,is proposed to address the problem that the speed of parallel compression algorithms is limited by a single machine.The algorithm performs fine-grained partitioning of SAM files while avoiding the problem of data skewing between tasks.By using rich RDD operators,an efficient and fault-tolerant parallel compression process is designed.A multi-threaded approach is used to parallelize multiple Jobs in Spark Application,and a high-performance serialization mechanism is used to reduce the cost of data transmission during Shuffle.The experimental results show that ZSAM-Spark is suitable for distributed scenarios and can effectively improve the compression speed.(3)A SAM-oriented gene compression and parallelization system is designed and implemented to address the problem of insufficient application of current compression algorithms in practical scenarios.It not only achieves load balancing and high availability of core services,but also fully considers the diversified requirements of users in the design of functional modules.Based on the multi-master and multi-slave architecture,the asynchronous submission and real-time monitoring of tasks are realized to ensure the flexibility and stability of the system.
Keywords/Search Tags:SAM, Gene Sequence Compression, Parallel Optimization, Distributed Storage, Distributed Computing
Related items