Font Size: a A A

Research On The Index For Similar Search Based On Hamming Distance

Posted on:2019-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2428330545965810Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Similar search is an important application in the fields of information retrieval,database application and pattern recognition.With the growing of network information,the form of data representation is also more abundant.How to retrieve target information quickly and efficiently in huge volume of data has always been an important research subject in the field of information retrieval.It is a challenging and hot issue that how to establish an efficient and convenient index to return the correct information in time.This paper mainly studies the issue of similar search in hamming space,given a dataset D and a query string Q,all strings which are similar to Q are returned in as little time as possible.This problem is called the approximate dictionary queries problem.The approximate dictionary queries problem can be divided into two phases to solve:1)Search phase:splitting the key in D and building indexes to find potential similar target candidates.2)Check phase:using effective strategies to quickly check these candidates and filtering out the results that truly satisfied the query conditions.This paper studies Search phase and Check phase and the main work is as follows:(1)Firstly,the Simhash is used to preprocess the datasets.With the operation of extracting,weighting,merging and dimensionality reduction,the high-dimensional data is processed into the Simhash fingerprint.(2)An improved multi-index algorithm based on hamming distance is proposed which is mainly used to find out the possible result in candidates.The idea of spliting key comes from previous multi-index algorithm in hamming space.Firstly,spliting the binary fingerprint into b blocks.The improved multi-index algorithm creates indexs with blocks and the structure of index based on the relationship between k and b.What's more,the efficiency of the query is better.(3)Two Check algorithms based on central point are proposed.The algorithm combines the idea of cluster,and they are used for filtering out the errors from the candidates.The greedy Check algorithm based on the center point is proposed,the center point P which are selected through greedy strategy produced a cluster at a time.The Check algorithm based on the center point is compared with the linear scanning method.The greedy Check algorithm based on central point is more efficient.
Keywords/Search Tags:Similar search, multi-index, Hamming distance, Simhash
PDF Full Text Request
Related items