Research On The Index For Similar Search Based On Hamming Distance

Posted on:2019-12-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhang

Full Text:PDF

GTID:2428330545965810

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Similar search is an important application in the fields of information retrieval,database application and pattern recognition.With the growing of network information,the form of data representation is also more abundant.How to retrieve target information quickly and efficiently in huge volume of data has always been an important research subject in the field of information retrieval.It is a challenging and hot issue that how to establish an efficient and convenient index to return the correct information in time.This paper mainly studies the issue of similar search in hamming space,given a dataset D and a query string Q,all strings which are similar to Q are returned in as little time as possible.This problem is called the approximate dictionary queries problem.The approximate dictionary queries problem can be divided into two phases to solve:1)Search phase:splitting the key in D and building indexes to find potential similar target candidates.2)Check phase:using effective strategies to quickly check these candidates and filtering out the results that truly satisfied the query conditions.This paper studies Search phase and Check phase and the main work is as follows:(1)Firstly,the Simhash is used to preprocess the datasets.With the operation of extracting,weighting,merging and dimensionality reduction,the high-dimensional data is processed into the Simhash fingerprint.(2)An improved multi-index algorithm based on hamming distance is proposed which is mainly used to find out the possible result in candidates.The idea of spliting key comes from previous multi-index algorithm in hamming space.Firstly,spliting the binary fingerprint into b blocks.The improved multi-index algorithm creates indexs with blocks and the structure of index based on the relationship between k and b.What's more,the efficiency of the query is better.(3)Two Check algorithms based on central point are proposed.The algorithm combines the idea of cluster,and they are used for filtering out the errors from the candidates.The greedy Check algorithm based on the center point is proposed,the center point P which are selected through greedy strategy produced a cluster at a time.The Check algorithm based on the center point is compared with the linear scanning method.The greedy Check algorithm based on central point is more efficient.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research And Improvement Of Text Similarity Detection Based On Simhash
2	A Fast Text Elimination Algorithm Based On Simhash
3	Research On Theories And Applications Of Similar Image Retrieval System Based On Improved Multi-index Hashing
4	Research On Memristor-based Hamming Distance Calculation Methods And Realization
5	Secure Ranked Search Based On Simhash Over Enervated Data
6	Fast Similar Image Search In Large Scale Image Database
7	Research On Text Similarity Detection Algorithm Based On Simhash
8	Research On Subgraph Search Technology In Graph Database
9	Similar News Recommendation Based On Simhash And CNN
10	Research On PUF Circuit Design Based On Memristor Array