Font Size: a A A

An Improved String Similarity Join Algorithm

Posted on:2014-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:X M HeFull Text:PDF
GTID:2268330401974577Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Into the21st century, the rapid development of information technology, the proliferation of computers and the wide application of the Internet, the global information sharing widening the explosive growth of the amount of information. In front of a lot of information, how accurately and quickly find the information we need. This computer science research has been a hot issue. The string similarity connection technology is a more effective way to solve the above problems, it can be queried from a collection of strings to identify the degree of similarity with the query string given all the strings that meet certain requirements. The string similarity connection technology in the areas of information retrieval, academic misconduct detection systems, bioinformatics, intrusion detection, spam filtering has great use of space. Traditional methods of facing the challenges of slow queries. Therefore, this study has a certain significance to improve query string similar connection speed.This paper studies based on the string edit distance similar connection technology. In this paper, two approaches to improve the query speed.First, based on a trie inverted index, trie tree is a tree structure, you can take advantage of the common prefix string index lookup speed. The trie tree inverted index to achieve not only be able to provide the query speed of existing algorithms but also in the more common prefix before the string has better efficiency.Second, left to fill algorithm and length filtering algorithm two strategies combined approach to optimize the existing algorithms. Left to fill algorithm can improve the number of public string fragment to a certain extent, be query string to be able to use the index merge algorithm to filter, to reduce the number of strings that need to verify that the verification phase, improves the query speed. Length filtering algorithm is simple and effective for its filter criteria and also reduce the validation phase needs to verify the number of strings that can filter out some not result string. Index merge algorithm and the length of the filter algorithm combined can further improve the query speed.
Keywords/Search Tags:string similarity joins, q-gram, inverted index, edit distance
PDF Full Text Request
Related items