Font Size: a A A

Research And Implementation Of Index Structure Of Biological Sequence

Posted on:2007-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:R W ZhangFull Text:PDF
GTID:2178360185485600Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of biotechnology, many projects to sequence the genome of some species are well advanced or concluded. The very large number of species that are of interest to man, suggest that many new sequence will be revealed as the improved sequencing techniques are deployed. In such a case, Researchers have been investigating the utility of indexes. Suffix tree is a good index structure for smaller sequences, but it not suit large sequences, due to the so-called"memory bottleneck". The suffix array is the closest competing structure, as it needs less space than a suffix tree. However, it is not convenient for searching. Approaches based on q-gram or q-sample are fast and proven, but cannot deliver matches that have low similarity to the query.In this paper, we present a layer-partition index which can be searched efficiently, allowing us to build index in excess of RAM size. This index is based on suffix tree, and is divided into some layers. A layer usually includes some child suffix trees. We build and store each layer independently.On the construction of our index, we adopt the notion of"top-down write-only". The construction algorithm embodies three parts that are suffix sorting, logest common prefix evaluation process and child suffix tree construction. For every layer of the index, the algorithm can be implemented in linear time.The storage of index is divided into two parts. The first part is the information of each child suffix tree. Child suffix tree can be represented by several methods which require different storage space. The second part includes the index of child suffix tree, and some other information for finding the location of every child suffix tree in disk quickly.According to different storage methods of child suffix tree, we introduce several searching technology. We also explore memory management in the application of our index structure.Finally, we show experimentally that our index can be built effectively with sequences whose index size exceeds the available RAM. We also implement the storage of index and searching technology.
Keywords/Search Tags:suffix tree, sequence index, biological sequence, genome storage
PDF Full Text Request
Related items