Font Size: a A A

Research On High Performance Biological Data Compression Algorithm Based On Heterogeneous Computing Platform

Posted on:2017-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y D DouFull Text:PDF
GTID:2180330485982223Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The DNA genetic sequencing technology and sequencing platform are continuously developing, bringing greater assistance to biological and medical researches with its low cost and high speed. However, high speed detection produces large scales of DNA data and how to store more DNA sequence data in limited storage space becomes a problem for computer scientists. My preferred solution is to compress the DNA gene sequence data.In the field of data compression, the history of the compression of general data is earlier than that of DNA genetic sequence data. At the beginning, general data compressing software was used to compress DNA genetic sequence data, but it didn’t make good use of the features of DNA genetic sequence data. Thus, there is a lot of room for improvement for DNA genetic sequence data compression.This paper is to select parallelizable compression algorithms and to enhance the eff ect of compression on the heterogeneous computing platform, based on a thorough stud y on the futures of DNA genetic sequences data and the existing classic special compres sion algorithm of DNA genetic sequences data.Given that, I first analyzed the features of the DNA sequence data and FASTQ storage format, and we found that making full use of the biological information in DNA sequence data and exploring the rules of DNA sequence data storage format will greatly improve the compression ratio of DNA sequence data.Next, referring to statistical compression algorithm, dictionary compression algorithm, and spatial redundancy compression algorithm,Iintroduced classical data compression algorithms, such as static and adaptive Huffman encoding algorithm, arithmetic encoding algorithm, LZ series compression algorithm and run length encoding algorithm. It is the basic algorithms that build a solid foundation for the development of general data compression and DNA genetic sequence data compression.In order to highlight the advantages of the existing DNA sequence data compression software over the general compression software when processing DNA sequence data, I introduced the general compression software and the special software of DNA sequence data compression. Furthermore, from angles of compression ratio, compression time and decompression time, I made a multi-dimensional comparison on the testing performance between the general compression software and the special software of DNA sequence data compression, and made a detailed analyze on the test outcome.Based on the test outcome, I chose the parallelizable DSRC, the better performer in the test among other special software of DNA genetic sequence data, as the basis of DSRC_HYBRID, which is paralleled and implemented on heterogeneous computing platform. Thus, I thoroughly analyzed and concluded the serial and parallel algorithms, process, and source code of DSRC. By designing the workflow and working mode of DSRC_HYBRID, I shared the data load to CUP and MIC through message passing interface (MPI) and achieved load balance, based on the test outcome of CPU and Mic data processing capability and VTune hotspot function test outcome. Eventually, I achieved the peer-to-peer MIC and CPU working mode by lock thoughts to achieve multi-threaded data synchronization.By using MIC co-processor, I achieved an average 17.956 times speedup of compression time and an average 15.203 times speedup of decompression without changing the compression ratio of DSRC algorithm on heterogeneous computing platform. Therefore, I draw the conclusion that accelerating the current special DNA sequence data compression algorithm on heterogeneous computing platform will be an optional approach to compress massive DNA sequence data.
Keywords/Search Tags:DNA genetic sequence data, classical data compression algorithm, DSRC algorithm, MIC co-processor, MPI
PDF Full Text Request
Related items