Font Size: a A A

Procedural enhancements to some approximate searching techniques

Posted on:2009-10-07Degree:Ph.DType:Dissertation
University:The George Washington UniversityCandidate:Strauss, Michael JFull Text:PDF
GTID:1448390002495920Subject:Computer Science
Abstract/Summary:
The ever-expanding use and ease of access to vast stores of data on the Internet demands efficient searching techniques and, in particular, fuzzy or approximate searching. Approximate searching often employs the linear examination of each data item under consideration to determine if the item is sufficiently close to a desired target to be included in a set of items responsive to a search request. Linear searching techniques are too inefficient to effectively search large data repositories. The present work describes a technique for directly accessing close matches by calculating indices grouping similar items and improvements to the basic technique that further enhances performance.; Approximate, inexact or fuzzy searching and matching to find sequences that are similar have long been implemented for string matching, further having applications in information retrieval, spellchecking and computational biology. Recent applications suggest using fixed-length vectors to characterize a wider range of materials and documents by associating each item with a binary attribute vector, each bit of a vector indicating whether a particular attribute is absent or present within the item. The present work continues prior efforts investigating characteristics and functionalities provided by the (23, 12, 7) Golay Code used to hash 23-bit binary attribute vectors to form 12-bit hash indices. Approximate matching is implemented by identifying the 1-bit distortions of each 23-bit attribute vector which, together with the undistorted vector, hash in quadruples to a compacted set of six distinct hash indices and corresponding Golay coding spheres.; We address two significant implementation problems that interfere with a practical implementation of an approximate matching scheme using the Golay code: the resultant hash codes are too small and the efficiency of the bit counting used to compare potentially matching attribute vectors limits search performance. To address the latter, we suggest implementing an improved vertical counting scheme including bundling at an input to the counter to simultaneously handle multiple words. The vertical counting technique implements the equivalent of a word-wide array of hardware bit adders to form a number of parallel counters, each having as an input one bit from an input word. Each stage of the vertical counter comprises a word of bits having a binary weighting corresponding to the position of the stage. Using bit parallel logical operations, the proposed technique combines or "bundles" multiple words to appropriately update the initial stages of the vertical counter. The 12-bit hash code limitation imposed by the Golay code is addressed by pairing of indices to greatly expand the address space and hash table size. Pairing further provides other advantages. Since all 2-bit distortions can be found in at least 2 of the nearest 6 coding spheres (the original and 5 nearest in the code space), pairing of the indices provides a set of 15 concatenated index pairs to support retrieval of all 2-bit distortions without duplication. Whereas there are 253 possible 2-bit distortions, all can be found in only fifteen locations using the present technique. The enhanced efficiency provides the capability of real time identification of approximate matches. Thus the technique can be applied to provide real time message filtering and monitoring, data transmission, etc. Finally, other uses for this technique are explored including classification of items to identify logical clusters, define meaningful partitions and uncover hidden information.
Keywords/Search Tags:Technique, Searching, Approximate, Data, Item
Related items