Font Size: a A A

Research On High-dimensional Index In Large-scale Image Retrieval

Posted on:2016-07-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P MaFull Text:PDF
GTID:1108330473456373Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Given a query image, large-scale image retrieval (LSIR) is to find its similar im-ages from large-scale database with high accuracy and efficiency. LSIR plays an im-portant role in many applications, such as multimedia retrieval, copyright infringe-ment detection, and network information monitoring.LSIR systems utilize visual feature extraction techniques to describe visual con-tents of images as high-dimensional digital vectors, and the image retrieval problem is converted into the similarity measure problem of high-dimensional data. In network LSIR systems, the dimensionality of descriptors is generally larger than one hundred and the quantity of vectors is more than ten million. So, high-dimensional index tech-niques are the key for the performance of LSIR. The performance of traditional tree-based high-dimensional index degrades severely as the the curse of dimensional-ity problem, and memory resource is the bottleneck in face of large-scale data. So it is a challenging problem to build effective high-dimensional index for large-scale da-taset to meet the need of high search performance and low memory cost.In order to achieve efficient, effective and low memory consumption high-dimensional index for large-scale datasets, this thesis studies three important problems, including distributed locality sensitive hashing, data-oriented multi-index hashing, and binary hierarchical index technology. The main contributions of the the-sis are summarized as follows:(1) Distributed Locality Sensitive HashingLocality sensitive hashing (LSH) is the popular algorithm for approximate nearest neighbor (ANN) search. As LSH partitions vector space uniformly and the distribu-tion of vectors is usually non-uniform, it poorly fits real dataset and has limited effi-ciency. In this thesis, we first propose a new data-dependent LSH (DP-LSH) algo-rithm, which has two-level structures. In the first level, we train a number of cluster centers, and use the cluster centers to divide the dataset into many clusters and the vectors in each cluster has near uniform distribution. In the second level, we construct LSH tables for each cluster. Given a query, we first determine a few clusters that it belongs to with high probability, and then perform ANN search in the corresponding LSH tables. Furthermore, we present an optimized distributed scheme and propose a distributed DP-LSH algorithm. Experimental results on the reference datasets show that the search speed of DP-LSH can be increased by 48 times compared to E2LSH, while keeping high search precision; and the distributed DP-LSH can further improve search efficiency.(2) Data-Oriented Multi-Index HashingMulti-index hashing (MIH) is the state-of-the-art exact nearest neighbor search method for indexing binary codes. However, MIH is based on the dataset codes uni-form distribution assumption, and will lose efficiency in dealing with non-uniformly distributed codes. Besides, there are lots of results sharing the same Hamming dis-tance to a query, which makes the distance measure ambiguous. In this thesis, we propose a data-oriented multi-index hashing method (DOMIH). We first compute the covariance matrix of bits and learn adaptive projection vector for each binary sub-string. Instead of using substrings as direct indices into hash tables, we project them with corresponding projection vectors to generate new indices. With adaptive projec-tion, the indices in each hash table are near uniformly distributed. Then with covari-ance matrix, we propose a ranking method for the binary codes. By assigning different bit-level weights to different bits, the returned binary codes are ranked at a fin-er-grained binary code level. Experiments conducted on reference large scale datasets show that compared to MIH the time performance of DOMIH can be improved by 36.9%-87.4%, and the search accuracy can be improved by 22.2%.(3) Binary Hierarchical IndexTo further improve search efficiency of binary codes, attention has been paid to perform approximate nearest neighbor search, in which Hierarchical Clustering Trees (HCT) is a widely used method. However, HCT selects cluster centers randomly and builds index with the whole binary codes, which degrades search performance. In this thesis, we first propose a new clustering algorithm, which chooses cluster centers on the basis of relative distances among them and has more homogeneous partition of the dataset than HCT, to build the hierarchical clustering trees. Then, we present an algo-rithm to compress binary codes by extracting distinctive bits according to the standard deviation of each bit. Consequently, a new index is proposed using compressed binary codes based on hierarchical decomposition of binary spaces. Experiments conducted on reference datasets and a dataset of one billion binary codes demonstrate the good performance of our method.As a conclusion, by addressing the limitation of existing high-dimensional index technologies, this thesis studies three key problems to improve the performance of index technologies in dealing with large-scale datasets. Our studies can provide solu-tions to better manage and utilize the large corpus of LSIR.
Keywords/Search Tags:High-Dimensional Index, Distributed Locality Sensitive Hashing, Data-Oriented Multi-Index Hashing, Binary Hierarchical Index
PDF Full Text Request
Related items