Font Size: a A A

Research On Q-gram Filters For Approximate String Matching

Posted on:2013-03-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:D C SunFull Text:PDF
GTID:1228330395485107Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, information can be acquired from allover the world and transmitted to anywhere. Now, text already becomes the mostwidely used media, but the huge text overflows the receptivity of a man. How to findout the desired text accurately and quickly in the huge text collection is a basicproblem in Computer Science. The technique of approximate string matching is thestraightest solution, which can find out all positions that match the pattern below acertain number of errors from a long text. It is widely used in Text Retrieval,Computational Biology, Signal Processing, Intrusion Detection, etc. Thus, research onenhancing the matching speed of approximate string matching has far-reachingpractical meanings.In this dissertation, we concentrate on q-gram filter based approximate stringmatching techniques. Many new features are extracted from match-region byanalyzing the text that contains true matches. Here, four lossless filter algorithms areproposed, i.e., one global algorithm and three local algorithms. The main content ofthis dissertation is presented as follows.1) A q-gram hits features based global approximate string matching algorithm isproposed. First, the pattern and text is split into blocks with equal size, and then threenew match-region features are extracted from blocks, i.e., the even distribution ofq-gram hits, the regional distribution of valid q-gram hits and the non-effect ofredundant q-gram hits. In filtration phase, a new choosing filtration-region scheme isproposed and the filtration-region is filtrated with a new filter criterion refined by thenew match-region features. The experimental results show that the new algorithmachieves higher filtration efficiency and it performs well under different matchingerror ratio. The experimental results also demonstrate that the proposed algorithm issuited for global approximate string matching with various matching error ratios.2) A diagonal features based local approximate string matching algorithm isproposed. By analyzing the edit matrix of two strings whose edit distance is not largerthan k, three new diagonal features are extracted from edit matrix, i.e., the total validdiagonals, the total q-gram hits on diagonals and the total q-gram hits on one diagonalstrode by each edit path. In filtration phase, a new choosing filtration-region schemebased on new diagonal features is used and the filtration-region is filtrated with a new filter criterion refined by the new diagonal features. The experimental results showthat the proposed algorithm achieves highest filtration efficiency compared withQUASAR and SWIFT. The experimental results also demonstrate that the proposedalgorithm is suited for local approximate string matching with small matching errorratios and short windows.3) A two-stage filtration based local approximate string matching algorithm isproposed. By the analysis of KS1algorithm and q-gram hits filters, a new filtrationscheme based on KS1and q-gram hits is presented. There are two stages in thefiltration phase. The first stage is seed filtration that can discard the text that does notcontain any seed of query immediately, and the second stage is hits filtration that caneliminate the text that contains at least one seed but whose q-gram hits does not satisfyfilter criterions. The experimental results show that the proposed algorithm achieveshighest filtration efficiency in a short filtration time compared with others. Theexperimental results also demonstrate that the proposed algorithm is suited for localapproximate string matching with various matching error ratios and window lengths.4) A consecutive match-blocks feature based local approximate string matchingalgorithm is proposed. First, the pattern and text is split into blocks with equal size,and then a new consecutive match-blocks feature is extracted from blocks, i.e., there isat least one z consecutive match-blocks in the text that contains true matches. Infiltration phase, the new feature is used to choose filtration-region and thefiltration-region is filtrated with the basic q-gram lemma. The experimental resultsshow that the proposed algorithm achieves higher filtration efficiency in an evenshorter filtration time compared with others. The experimental results alsodemonstrate that the proposed algorithm is suited for local approximate stringmatching with various matching error ratios and window lengths.
Keywords/Search Tags:Approximate string matching, global alignment, local alignment, filteralgorithm, q-gram filter, q-gram index, match-region feature, losslessfilter
PDF Full Text Request
Related items