Research Of DNA Data Compression Method

Posted on:2015-03-13

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Tan

Full Text:PDF

GTID:1268330422981631

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Due to the recent advancements in DNA sequencing technologies, the biomedicalresearch is faced with the problems of storage and transmission of the huge amount of DNAdata. The use of compression techniques provides an important candidate solution for thestorage and transmission challenges of DNA data, in other words, the DNA data is stored inthe smaller space by the effective coding methods. Because of the particularity of the DNAdata, it is not very ideal with traditional compression algorithms. Thus compressionalgorithms for DNA data specially appeared, including substitutional methods and statisticalmethods. The substitutional methods look for high frequently repeated fragments of DNA inoriginal DNA data, and index these fragments into a dictionary, which will be used to indteadthese fragments of DNA data; The statistical methods estimate the corresponding probabilitystatistical model through statistical rule of original DNA data, and then DNA data is entropycoded by this statistical model. These two types of compression methods have bettercompression effect for the benchmark DNA sequences, but limited for high-throughput DNAdata compression. So, a series of high-throughput DNA data compression methods appeared.At present, the high-throughput DNA data compression methods mainly includeresequencing DNA data compression method and de novo sequencing DNA data compressionmethod. The resequencing DNA data compression method mainly based on the referencegenome, which obtain the high compression ratio but require reference genome, but in fact,some sequences have no existing reference data, and because the same reference data isneeded by compression and decompression, so these reference genome must be stored in thelocal, increasing the costs. The de novo ssequencing DNA data compression methods do notrely on external reference genome and the completeness is better, but is still limited by theshort read splicing technology.In view of the above problem, the research status of DNA data compression method aresurveyed in this thesis, the challenges and prospects faced by DNA data compression methodare also discussed. Finally, several kinds of DNA data compression methods are proposed.The contribution of this dissertation mainly has the following several aspects:(1) A compression algorithm called DNAEC is proposed for DNA data compression, inwhich the extended operations are used. The three edit operations are extended to eightoperations so that the DNA data can form the three match patterns (exact match,complementary palindrome match and self repeat match), and then can be well compressed bythe idea of LZ, encoding the unmatched fragments with the order-2arithmetic code. Simulation results show better compression performance of DNAEC than typical DNA datacompression algorithms on benchmark DNA sequences, especially for longer DNA data.(2) On memetic algorithm framework, a DNA data compression method based oncollaborative particle swarm optimization-based memetic Algorithm (CPMA) is proposed, inwhich the comprehensive learning particle swarm optimization (CLPSO) is used for globalsearch and a dynamic adjustive chaotic search operator (DACSO) as the local searchrespectively. In CPMA, it looks for the global optimal code book based on extendedapproximate repeat vector (EARV), by which the DNA data is compressed. Experimentalresults demonstrate better performance of CPMA than the other optimization algorithms, andit is very close to the global optimal points in most of the test functions adopted by the thesis.The compression performance of the method based on CPMA is markedly improvedcompared to many of the classical DNA sequence compression algorithms on benchmarkDNA sequences.(3) A novel hybrid particle swarm optimization based memetic algorithm (HPMA) isproposed for DNA data compression, in which dynamic comprehensive learning particleswarm optimization (DCLPSO) method within the frameworkl of the memetic algorithm isused for global search and two adaptive local searches (LS) operators including centersymmetry mutation differential evolution (CSMDE) operator and adaptive chaotic searchoperator (ACSO) work in cooperative way. HPMA looks for the global optimal code bookbased on EARV, by which the DNA sequence will be compressed. Experiments wereconducted on19high-dimensional functions and11real DNA sequences. The results showthat HPMA is more competitive in both the performance and scability, and also attains bettercompression ability than other representative DNA specific algorithms on DNA sequencedata.(4) Considering the DNA data compression algorithm based on the reference genome,the dependence have been too strong. A novel high-throughput DNA data compressionmethod based on codebook index transformation (CITD) is proposed, in which the codebookindex transformation (CIT) model is adopted to substitute the traditional representation ofcodebook indexes by the quaternary values which are expressed by the four standard basecharacters, and adopts a simple encoding method to distinguish the replaced and non-replacedsubstring, and subsequently determines whether need to use the burrows-wheelertransformation (BWT) according to the value of information entropy, finally uses move tofront (MTF) transformation and Huffman entropy coding to compress the data. Experimentalresults on several sequencing data sets demonstrate better performance of CITD than the high-throughput DNA data compression algorithms cited in this thesis, in most cases.(5) A compression algorithm, called DNAC-K, is proposed, which is based on K-meansclustering for high-through DNA data and creates cluster of sequences based on K-meansclustering method at first, then the clusters are iterated according to the edit distances ofsubsequences, and finally, Huffman coding is adopted to encode the clustering result.Experimental results on several sequencing data sets demonstrate better performance ofDNAC-K than many of the previous high-throughput DNA data compression algorithms.

Keywords/Search Tags:

DNA data compression, extended operations, EARV, daptive chaotic searchoperator, memetic algorithm, codebook index transformation model, k-means clustering

PDF Full Text Request

Related items

1	Adaptive Memetic Algorithm For Data Clustering
2	Study Of A Image Retrieval System Based On Clustering Index
3	Adaptive Niching Memetic Algorithm Based Data Clustering
4	Research On Cluster Center Optimization Of K-means Algorithm
5	Hybrid Memetic Algorithms And Their Change Detection Applications
6	Research Of Clustering Algorithms
7	Improvement And Empirical Study Of K-means Ciustering Algoirthm On Panel Data Analysis
8	Research And Application Of K-means Clustering Algorithm
9	Optimization Method Of Multi-distribution Centers Location Based On K-means Clustering Algorithm And Evidential Reasoning Approach
10	Study Of Chinese Text Clustering On Improved K-means Algorithm