Font Size: a A A

Research And Implementation Of Compress Storage Of Biological Sequences And Indices

Posted on:2008-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y R ZhengFull Text:PDF
GTID:2178360245498170Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Ever since the structure of DNA was unraveled in 1953, molecular biology has witnessed tremendous advances. Recent advances in sequencing technology have allowed the rapid accumulation of DNA and protein data. A huge amount of biosequences have been and are being generated in laboratories all over the world. It is well known that their size increases nowadays exponentially. Therefore, the design of effective indices and compression of genetic information constitute a very important task.There are large-scale biological sequence informations stored in biological database. It is well known that many essential genes (like rRNAs) have many copies and repeat substring. It also has been conjectured that genes duplicate themselves for evolutionary or simply for"selfish"purposes. All this evidence gives more concrete support that the DNA sequences should be reasonably compressible. This paper presents a structure parallel local optimization compression algorithm which divided biological sequence data into multiple processors. In a single processor, it uses generalized suffix tree to identify redundant sequences substring. With this method, we can significantly increase the compression speed. Bases on the sequence divided technology, the paper also introduces a structure parallel search operation.Indexing technology accelerates the pace of processing sequences, the existing technology, includes suffix tree, suffix arrays, q-gram and q-sample, and so on. The processing speed of suffix tree is fast, but it not suit large sequences, due to the so-called"memory bottleneck". In this paper, we presents a compressed layer-partitioned index. The construction of the layer-partitioned index adopts the notion of"top-down write-only". It is divided into several layers, it is based on suffix tree, and is divided into several layers, each layer can be built and store independently. This avoids the"memory bottleneck"problems. It uses storage space effectively with while guarantees the efficiency of establishment and search operations. Then we presented the search operations on this new index.Experiments show that with the use of this sequence compression method, the compression efficiency and processing time has made marked improvement. The storage space demand of layer-partitioned index has been greater ease, while its search operation efficiency has not been significantly affected.
Keywords/Search Tags:compression, parallel processing, biological sequence, suffix tree, layer-partitioned index
PDF Full Text Request
Related items