Font Size: a A A

Research On Construction Of Index Structure For Biological Sequences

Posted on:2010-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2178330332988643Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Bioinformatics is the science of using computer technology to store, retrieve and analyze biological information in the field of life sciences. To develop rapid and effective computer algorithm to find knowledge from very large biological data is the main research work.This thesis mainly focuses on the study of suffix tree and suffix array index technical dealing with bio-sequences. First, index structure of bio-sequences is described, then several algorithms of indes structure construction based on the suffix array and suffix tree are introduced and analysized. Suffix tree is a good index structure for smaller sequences, but it not suit large sequences, due to the so-called "memory bottleneck". The suffix array is the closest competing structure, as it needs less space than a suffix tree. However, it is not convenient for searching. So, a top-down method of construction for suffix trees step by step is presented. All the suffixes of a string are firstly sorted according to lexicographical order. Then, the longest common prefixes of all pairs of adjacent suffixes on the suffix array between them are computed. Suffix trees are finally constructed based on the suffix order and the longest common prefixes of all pairs of adjacent suffixes. This method abandons the suffix-links so as to overcome so-called "memory bottleneck". The time complexity of the method is O(n).
Keywords/Search Tags:suffix tree, suffix sorting, suffix array, longest common prefix, top-down
PDF Full Text Request
Related items