Research And Implementation Of Compress Storage Of Biological Sequences And Indices

Posted on:2008-09-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y R Zheng

Full Text:PDF

GTID:2178360245498170

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Ever since the structure of DNA was unraveled in 1953, molecular biology has witnessed tremendous advances. Recent advances in sequencing technology have allowed the rapid accumulation of DNA and protein data. A huge amount of biosequences have been and are being generated in laboratories all over the world. It is well known that their size increases nowadays exponentially. Therefore, the design of effective indices and compression of genetic information constitute a very important task.There are large-scale biological sequence informations stored in biological database. It is well known that many essential genes (like rRNAs) have many copies and repeat substring. It also has been conjectured that genes duplicate themselves for evolutionary or simply for"selfish"purposes. All this evidence gives more concrete support that the DNA sequences should be reasonably compressible. This paper presents a structure parallel local optimization compression algorithm which divided biological sequence data into multiple processors. In a single processor, it uses generalized suffix tree to identify redundant sequences substring. With this method, we can significantly increase the compression speed. Bases on the sequence divided technology, the paper also introduces a structure parallel search operation.Indexing technology accelerates the pace of processing sequences, the existing technology, includes suffix tree, suffix arrays, q-gram and q-sample, and so on. The processing speed of suffix tree is fast, but it not suit large sequences, due to the so-called"memory bottleneck". In this paper, we presents a compressed layer-partitioned index. The construction of the layer-partitioned index adopts the notion of"top-down write-only". It is divided into several layers, it is based on suffix tree, and is divided into several layers, each layer can be built and store independently. This avoids the"memory bottleneck"problems. It uses storage space effectively with while guarantees the efficiency of establishment and search operations. Then we presented the search operations on this new index.Experiments show that with the use of this sequence compression method, the compression efficiency and processing time has made marked improvement. The storage space demand of layer-partitioned index has been greater ease, while its search operation efficiency has not been significantly affected.

Keywords/Search Tags:

compression, parallel processing, biological sequence, suffix tree, layer-partitioned index

PDF Full Text Request

Related items

1	Research And Implementation Of Index Structure Of Biological Sequence
2	Research On Construction Of Index Structure For Biological Sequences
3	The Research And Implementation Of Biological Sequence Alignment
4	Research On Approximate Query Processing Techniques Supporting Scoring Matrices
5	Research And Application Of Index On Biological Sequences
6	An Algorithm Based On Suffix Tree For Identification Of Repeats In DNA Sequence
7	Research And Implementation Of Distributed And Parallel Algorithm For Large-Scale Generalized Suffix Tree Construction
8	Research On Temporal XML Index Based On Suffix Tree
9	Algorithms For Top-k Document Retrieval
10	Finding MUMs With Enhanced Suffix Arrays