Font Size: a A A

Research On Compression Of DNA Data Based On K-mer Short Sequences

Posted on:2015-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:W P XiongFull Text:PDF
GTID:2298330422481968Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Because of the huge amount of DNA sequence data, the relevant compression technology of the DNA date is one of the indispensable key technologies in bioinformatics, and it is the basis of efficiently store, reading and transmission, as well as the pre-condition for DNA sequencing, sequence alignment, gene prediction, etc. Therefore, the research of DNA sequence data compression technology is of great theoretical significance and application value. In recent years, with the development of information processing technology and the in-depth research of DNA sequence data characteristics, a variety of specific DNA sequence data compression algorithm is springing up.In this thesis, the repeatability of k-mer sequence fragment with small length has been statistically analyzed based on the characteristics of high repeatability in DNA sequence data. And then, we summarized and concluded the distribution rule of k-mer short sequences in DNA sequence data.For the great difference of k-mer distribution in the different DNA regional fragments, a DNA data compression algorithm based on segmented encoding is proposed in this thesis. In the preprocessing stage, the DNA sequence is divided into short sequences fragments with64bases, and all fragments are coded independently. Firstly, we calculate the number of3-mer with the highest repetition in each fragment, and then code these3-mer sequences according to their number and locations, thereby to compress DNA sequence. Segmented encoding algorithm is simple, and has good performance when testing for commonly used benchmark DNA sequences.For the high repetition rate of some k-mers in DNA sequences when/is small, a DNA data compression algorithm based on hybrid GA-PSO optimization is proposed in this thesis. We combine different k-mers which have the same lengths as different particles, and then, hybrid GA-PSO optimization algorithm is used to determine the optimal k-mer combination in the whole DNA sequence which has high repetition rate and can achieve maximum compression ratio. Encode these optimal k-mers, thus to compress DNA data. In hybrid GA-PSO optimization algorithm, we use support vector machine to divide DNA particles into two parts before each round of optimization, and optimize the two parts with GA algorithm and PSO algorithm respectively. The experimental results show that the algorithm can obtain a desired compression rate, and compared with the traditional algorithms, this algorithm has better robustness.
Keywords/Search Tags:DNA compress, k-met repeatability, segmented encoding, bases particle swarm, hybrid optimization
PDF Full Text Request
Related items