Font Size: a A A

Research On Approximate Query Processing Techniques Supporting Scoring Matrices

Posted on:2012-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:W J SunFull Text:PDF
GTID:2248330395957655Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the fast development of bioinFormatics, a lot of biological data has been generated. Research and analysis of these biological data is significant For guiding the research on life science and revealing the law of origin and evolution of life. Biological sequence and biological motif are two important elements of these biological data. When people want to classify a new biological sequence with its motif inFormation it implies or find the occurrences of a certain biological motif in a biological sequence, they need to match the biological sequence with the biological motif.Initially, biological motifs are usually represented using fixed strings, as a result, the matching problem of sequences and motifs in bioinFormatics field is solved mainly by the technique of sequence similarity analysis, containing the exact query processing and approximate query processing. However, not all biological motifs can be properly represented using fixed strings and many biological motifs are kept in the Form of scoring matrices. With the fast expanding of biological motifs represented by scoring matrices and the development of databases storing these motifs, corresponding new approximate query processing methods which can solve the matching problem of sequences and this kind of motifs are needed. And this is also the problem what we want to solve in our work.According to the condition that the query submitted in a query processing can either be a sequence or motifs represented with scoring matrices, we distinguish the approximate query processing which supports scoring matrices into on-line and off-line situations. In the case of on-line, we summarize the filtrating principles during the query processing, analyze two possible optimization strategies, and propose a look-ahead filtrating algorithm supporting multiple scoring matrices, combining the two optimization strategies. This algorithm first preprocesses all of the input scoring matrices to construct a global filtrating automation, then scans the sequence with the automation to find the match of input matrices in the sequence。 In the case of off-line, we propose a query processing algorithm of multiple scoring matrices based on the suffix tree index. This algorithm preprocesses all of the input scoring matrices in a query and then determines the matches of all these matrices with the sequence through a depth-first traversal of the sequence. The results of experiments and tests on the real nucleotide sequence and the database whose motifs are represented with scoring matrices show that the algorithms we propose have improved the efficiency of query processing, comparing with Former algorithms.
Keywords/Search Tags:Scoring matrices, approximate query processing, biological sequence, filtrate, suffix tree
PDF Full Text Request
Related items