Research On Approximate Query Processing Techniques Supporting Scoring Matrices

Posted on:2012-10-14

Degree:Master

Type:Thesis

Country:China

Candidate:W J Sun

Full Text:PDF

GTID:2248330395957655

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the fast development of bioinFormatics, a lot of biological data has been generated. Research and analysis of these biological data is significant For guiding the research on life science and revealing the law of origin and evolution of life. Biological sequence and biological motif are two important elements of these biological data. When people want to classify a new biological sequence with its motif inFormation it implies or find the occurrences of a certain biological motif in a biological sequence, they need to match the biological sequence with the biological motif.Initially, biological motifs are usually represented using fixed strings, as a result, the matching problem of sequences and motifs in bioinFormatics field is solved mainly by the technique of sequence similarity analysis, containing the exact query processing and approximate query processing. However, not all biological motifs can be properly represented using fixed strings and many biological motifs are kept in the Form of scoring matrices. With the fast expanding of biological motifs represented by scoring matrices and the development of databases storing these motifs, corresponding new approximate query processing methods which can solve the matching problem of sequences and this kind of motifs are needed. And this is also the problem what we want to solve in our work.According to the condition that the query submitted in a query processing can either be a sequence or motifs represented with scoring matrices, we distinguish the approximate query processing which supports scoring matrices into on-line and off-line situations. In the case of on-line, we summarize the filtrating principles during the query processing, analyze two possible optimization strategies, and propose a look-ahead filtrating algorithm supporting multiple scoring matrices, combining the two optimization strategies. This algorithm first preprocesses all of the input scoring matrices to construct a global filtrating automation, then scans the sequence with the automation to find the match of input matrices in the sequenceã€‚ In the case of off-line, we propose a query processing algorithm of multiple scoring matrices based on the suffix tree index. This algorithm preprocesses all of the input scoring matrices in a query and then determines the matches of all these matrices with the sequence through a depth-first traversal of the sequence. The results of experiments and tests on the real nucleotide sequence and the database whose motifs are represented with scoring matrices show that the algorithms we propose have improved the efficiency of query processing, comparing with Former algorithms.

Keywords/Search Tags:

Scoring matrices, approximate query processing, biological sequence, filtrate, suffix tree

PDF Full Text Request

Related items

1	Approximate Longest Common Subsequence Query Processing And Optimization On Biological Sequence
2	Research And Implementation Of Compress Storage Of Biological Sequences And Indices
3	Research And Implementation Of Index Structure Of Biological Sequence
4	Research On Construction Of Index Structure For Biological Sequences
5	The Research And Implementation Of Biological Sequence Alignment
6	Multi-pattern Matching With Wildcards Based On Suffix Tree And Suffix Array
7	An Algorithm Based On Suffix Tree For Identification Of Repeats In DNA Sequence
8	Research Of Approximate Query Processing Technology For Large Scale Data
9	Algorithms For Top-k Document Retrieval
10	Finding MUMs With Enhanced Suffix Arrays