Font Size: a A A

Dna Sequence Data Compression Algorithm

Posted on:2013-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:M M HuFull Text:PDF
GTID:2218330371457248Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Since the end of the 20th century, biological sequencing technology has a continuous development, and the various associated biological data has formed a huge bioinformatics database rapidly. It is an awkward problem for biologists and computer experts to analyze and manage these mass of data effectively. Data compression techniques can solve the problem well. DNA sequence data is one of the most important biological data. It is really different from other types of data which also need to be compressed, and almost all the universal data compression algorithms fail to compress it. Therefore, the study of data compression algorithm for DNA sequences is of great significance.The compression of DNA sequences is a very difficult task, and many scholars are working on it and propose a few classical algorithms. Some of them compress well and are common in use, such as the CTW+LZ algorithm, DNACompress and DNAPack. They are all based on approximate matching, searching and encoding approximate repeats, for DNA sequences always contain many repeats which are duplicated except for the substitution, insertion, or deletion of a few bases. However, the search for approximate repeats costs lots of time and memory while the enhancement of the compression ratio is not very obvious. As a result, this paper departs from the previous strategy by searching and encoding only exact repeats (including complementary palindromes), then describes another algorithm, called DNACE (DNA Compression based only on Exact matching). We are wondering how much compression we can achieve working only with exact matches.The main work:first, make a deep research on universal data compression algorithms and the existing algorithms for DNA sequences, including advantages, disadvantages and room for improvement. Second, make a further research on the characteristics of DNA sequences which is the theoretical basis of better compression. Third, design the DNACE algorithm. It combines the LZ77 algorithm with LZ78 to compress the exact matches which are choosed by a dynamic programming approach in DNA sequences and uses Arith-2 compression based on PPM model for the rest non-repeat regions. Finally, test the compression performance of DNACE. Tests show that, DNACE is easy to realize, runs fast and achieves a compression ratio very close to most DNA compresors. This compression algorithm is also useful for bioinformatics research.
Keywords/Search Tags:DNA sequence data, data compression, exact matching
PDF Full Text Request
Related items