Font Size: a A A

Approximate Text Alignment Techniques And Optimization Approaches

Posted on:2013-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:H L LiuFull Text:PDF
GTID:2298330467976372Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently, most of information on the Internet is stored as text, so techniques about information retrieval, such as approximately substring query, are requiring more attention. Meanwhile, with the development of sequencing technology, it is getting a lot easier to assemble a person’s gene. The development of sequencing technology has thrown a big challenge for gene data analysis techniques such as local alignment tecniques. Acually, both the information retrieval problem and the gene analysis problem are using text alignment techniques as their core part, so we show how to improve the text alignment technique to solve problems of approximately substring search and loal alignment more efficiently and effectively.Currently, there already exist some classical algorithms about substring search, such as the BM (Boyer-Moore) algorithm and the KMP (Knuth-Morris-Pratt) algorithm, but they are focusing on exactly substring search problem. Besides, current approximatly string search algorithms usually have a given edit distance threshold k, and then report those strings whose edit distances with the query are not bigger than k. However, when talking about approximate substring search, many answers got in this way have overlap, thus meaningless. So, it is both important and challenging to improve the text alignment techniques to solve this problem in an efficient way. For local alignment problem, there are some very classical tools that could solve this probem. However, they are all suffering some flaws in efficiency or accuracy. For example, the SW (Smith-Waterman) algorithm can guarantee to find all results but is both time and space intensive that could not be used in practice. BLAST (Basic Local Alignment Search Tool) is a very classical and currently very popular tool, but it can not guarantee to find all results though very efficient. A recently proposed software called BWT-SW can guarantee to find all results and very efficient, however, it is still a lot slower than BLAST. So, how to improve the efficiency of local alignment algorithms without losing their accuracy is very challenging. For the problem of approximate substring search, we first propose a concept of local optimal matching to eliminate overlaped results. Then we analyze the generality of the matching process, and propose methods of filtering and limiting the boundary based on local optimal matching. Finally, with the algorithm of approximate substring search supporting local optimal matching based on gram index, we propose an algorithm of local optimal approximate substring search with filtering. For the problem of local alignment, we propose a family of filtering and reusing techniques based on dynamic programming. We first analyze the process of dynamic programming, and then propose a family of filtering techniques to avoid meaningless calculations and reusing techniques to avoid duplicate calculations. Finally, we propose the ALAE (Accelerating Local Alignment with Affine Gap Exactly) algorithm, which could not only guarantee to find all results but also holds a very high efficiency. Finally, experiments over real data demonstrate the efficiency of our algorithms.
Keywords/Search Tags:string matching, approximate substring search, gram index, BWT index, editdistance, text alignment
PDF Full Text Request
Related items