Font Size: a A A

Using Random Projection Technology To Find Biological Sequence Features Of The Algorithm

Posted on:2003-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:W HeFull Text:PDF
GTID:2208360065955838Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus these algorithms must trade off between efficiency and sensitivity to features without long exact matches.In this paper, we first analyse two schemes to reduce the complexity of similarity search algorithms: filtering heuristic and deterministic exclusion method. The deterministic exclusion algorithms are designed to have 100% guaranteed sensitivity to similarities above a user-specified threshold, but they do not scale well to large all-pairs problems at low but practically interesting identity thresholds of 65-70%. Practical word matching is much more efficient in practice, but its sensitivity depend on the distribution of mutations in the interesting similarities in an uncontrolled way, and its efficiency or sensitivity also decays substantially for similarities around 67% identity.Basing on the analysis of above two algorithms, we propose a new algorithm, RP-ALL-PAIRS, which uses a method of randomized project to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. RP-ALL-PAIRS obtains randomized projects of DNA sequences bylocality-sensitive hash function, under which the two strings project to the same value varies directly with their degree of similarity. Adopting a randomized approach to similarity search breaks through the curse of dimensionality, achieved substantially better performance than deterministic exclusion algorithms, such as double filteration, at levels of similarity as low as 67% ungapped identity. In practice, RP-ALL-PAIRS complements word match filtering techniques for annotation because it finds biologically meaningful similarities that are difficult to detect using even relatively short word lengths.In this paper, the analysis of the performance of RP-ALL-PAIRS is given, and how to choose optimal parameters for RP-ALL-PAIRS is discussed based on the performance study. Our experiment shows that RP-ALL-PAIRS achieves a better trade-off between efficiency and sensitivity in finding features of DNA sequence.
Keywords/Search Tags:genomic DNA sequence features, data mining, randomized project, efficiency and sensitivity
PDF Full Text Request
Related items