Research On Construction Of Index Structure For Biological Sequences

Posted on:2010-09-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y Huang

Full Text:PDF

GTID:2178330332988643

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Bioinformatics is the science of using computer technology to store, retrieve and analyze biological information in the field of life sciences. To develop rapid and effective computer algorithm to find knowledge from very large biological data is the main research work.This thesis mainly focuses on the study of suffix tree and suffix array index technical dealing with bio-sequences. First, index structure of bio-sequences is described, then several algorithms of indes structure construction based on the suffix array and suffix tree are introduced and analysized. Suffix tree is a good index structure for smaller sequences, but it not suit large sequences, due to the so-called "memory bottleneck". The suffix array is the closest competing structure, as it needs less space than a suffix tree. However, it is not convenient for searching. So, a top-down method of construction for suffix trees step by step is presented. All the suffixes of a string are firstly sorted according to lexicographical order. Then, the longest common prefixes of all pairs of adjacent suffixes on the suffix array between them are computed. Suffix trees are finally constructed based on the suffix order and the longest common prefixes of all pairs of adjacent suffixes. This method abandons the suffix-links so as to overcome so-called "memory bottleneck". The time complexity of the method is O(n).

Keywords/Search Tags:

suffix tree, suffix sorting, suffix array, longest common prefix, top-down

PDF Full Text Request

Related items

1	Finding MUMs With Enhanced Suffix Arrays
2	Multi-pattern Matching With Wildcards Based On Suffix Tree And Suffix Array
3	Research Of Finding Maximal Unique Matches In Genome
4	Approximate String Matching Based On Sufifx Array
5	A Frequent String Mining Algorithm Based On Optimized LCP Table
6	Performance Optimization Of A New External Memory Algorithm For Suffix Array Construction
7	An Algorithm Based On Suffix Tree For Identification Of Repeats In DNA Sequence
8	Approximate Longest Common Substring Matching And Optimization Techniques With Edit Distance Constraint
9	Research On Vietnamese News Topic Recognition Method Based On Suffix Tree Clustering Algorithm
10	K-Prefix Tree Full-Text Search Method And Application