Font Size: a A A

Approximate String Matching For Chinese Characters By Combining Filtering And Bit-parallelism

Posted on:2007-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:L X FanFull Text:PDF
GTID:2178360182466663Subject:Computer applications
Abstract/Summary:PDF Full Text Request
String matching is one of the basic problems in computer science. Researches are focus on exact string matching in early time, and many algorithms of single and multiple string matching have been presented.However, the number of applications for approximate string matching grows every day. We have found solutions to the most diverse problems based on approximate string matching, for instance, signal processing, text retrieval,computational biology, virus and intrusion detection, pattern recognition, optical character recognition, and so on. It has very important theoretical value and practical meaning to research and design very fast approximate string matching algorithms.Approximate string matching is an extension of the string matching problem which can been simply described as the type of allowing errors(differences) between pattern and substring of the text.Multiple approximate string matching is also a natural extension of the approximate string matching problem which allowing errors between multiple patterns and substring of the text.For the problem of approximate string matching, though a number of algorithms have been proposed, there are few studies which focus on large size of alphabet Σ. Most of experts are interested in small or middle size of alphabet Σ, For large size of Σ, especially for Chinese characters, there are few efficient algorithms. This is also the same status with the multiple patterns matching problem. Until now, we have not find any efficient algorithm for searching of multiple patterns based on large size of alphabet E.For these reasons, the work of this thesis is organized as follows.In chapters 1 and 2, we show the motivation, the target of this work. We state the basic theories of approximate string matching, and also introduce main approaches which include dynamic programming, automata, bit-parallelism, and filtering.In chapters 3, we analyze some important algorithms which will be compared by or used in our new algorithms.Chapters 4 and 5 cover all impotant works about this thesis. In charpter 4 we give an improved algorithm IBPM-BM which achieve good performance closed to BPM-BM in practice, and obtain better worst case time. In charpter 5 we present a fast algorithm MBPM-BM which can be used for searching of multiple patterns,especially in matching of Chinese characters. The most important of these two algorithms is bit-parallelism technology in filteration phase.We show that our algorithms are not only theoretically appealing but also good in practice. Our experimental results show that our two algorithms work well in practice, and achieve excellent performce especially in matching of Chinese characters.In the chapters 6, we conclude the work and discuss future work.
Keywords/Search Tags:Approximate string matching, Dynamic programming, Automata, Bit-parallelism, Filtering, Edit distance, Chinese string matching, Multiple string matching
PDF Full Text Request
Related items