Using Random Projection Technology To Find Biological Sequence Features Of The Algorithm

Posted on:2003-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:W He

Full Text:PDF

GTID:2208360065955838

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus these algorithms must trade off between efficiency and sensitivity to features without long exact matches.In this paper, we first analyse two schemes to reduce the complexity of similarity search algorithms: filtering heuristic and deterministic exclusion method. The deterministic exclusion algorithms are designed to have 100% guaranteed sensitivity to similarities above a user-specified threshold, but they do not scale well to large all-pairs problems at low but practically interesting identity thresholds of 65-70%. Practical word matching is much more efficient in practice, but its sensitivity depend on the distribution of mutations in the interesting similarities in an uncontrolled way, and its efficiency or sensitivity also decays substantially for similarities around 67% identity.Basing on the analysis of above two algorithms, we propose a new algorithm, RP-ALL-PAIRS, which uses a method of randomized project to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. RP-ALL-PAIRS obtains randomized projects of DNA sequences bylocality-sensitive hash function, under which the two strings project to the same value varies directly with their degree of similarity. Adopting a randomized approach to similarity search breaks through the curse of dimensionality, achieved substantially better performance than deterministic exclusion algorithms, such as double filteration, at levels of similarity as low as 67% ungapped identity. In practice, RP-ALL-PAIRS complements word match filtering techniques for annotation because it finds biologically meaningful similarities that are difficult to detect using even relatively short word lengths.In this paper, the analysis of the performance of RP-ALL-PAIRS is given, and how to choose optimal parameters for RP-ALL-PAIRS is discussed based on the performance study. Our experiment shows that RP-ALL-PAIRS achieves a better trade-off between efficiency and sensitivity in finding features of DNA sequence.

Keywords/Search Tags:

genomic DNA sequence features, data mining, randomized project, efficiency and sensitivity

PDF Full Text Request

Related items

1	Identifying functional lox sequences: A genomic search and randomized libraries
2	Genomic data mining enhanced by symbolic manipulation of Boolean functions
3	Genomic data analysis and processing with signal processing techniques
4	Compressed Index And Query Optimization On Large-Scale Genomic Data
5	The Maximal Frequent Pattern Mining Of Dna Sequence
6	Industrial Applications of Data Mining Engineering Effort Forecasting based on Mining and Analysis of Patterns in Historical Project Execution Data
7	Data Mining for Discovery of Clinical and Genomic Disease Markers
8	Research And Design On Privacy-Preserving Data Mining Approaches And Algorithms
9	Construction Of The Genomic Promoter Sequence Database And Web Application System
10	How To Optimize B Company Project Schedule Efficiency For Bank Data Center Project