Font Size: a A A

Similarity search for large-scale image datasets

Posted on:2007-10-30Degree:Ph.DType:Thesis
University:Princeton UniversityCandidate:Lv, QinFull Text:PDF
GTID:2458390005486296Subject:Computer Science
Abstract/Summary:
Content-based image similarity search is a difficult problem due to the high dimensionality and usually massive amount of image data. The main challenge is to achieve high-quality similarity search with high speed and low space usage. This thesis proposes several techniques to address the problem of building a similarity search system for large-scale image datasets. A prototype image search system, called CASS-Image (Content-Aware Search System for Images), has been implemented to demonstrate the effectiveness of these techniques.; The first contribution of this thesis is a sketch construction algorithm that converts high-dimensional feature vectors into bit vectors (sketches), such that the weighted (and thresholded) ℓ1 distance between two feature vectors can be estimated by the Hamming distance of their sketches. Experimental results show that using sketches can typically reduce the space requirement by an order of magnitude with minimal impact on similarity search quality.; The second is a hash-perturbation based LSH (Locality Sensitive Hashing) technique for approximate nearest neighbor search in high dimensions. This technique probes multiple buckets in each hash table by perturbing the hashed value of the query object. Performance evaluations show that this method is both time and space efficient. It has a similar time efficiency as the basic LSH method while reducing the space requirement by a factor of five. Also, its time efficiency is twice that of the point-perturbation based LSH method.; The third is a multi-feature filtering algorithm for region-based image similarity search. This method uses approximation algorithms to generate a candidate set, and then ranks the objects in the candidate set with a more sophisticated multi-feature distance measure. It works for both feature vectors and their sketches. It can also be combined with indexing techniques to further speed up the search process. Performance evaluations show that filtering is 4--13 times faster than the brute-force approach, while still maintaining good search quality.; This thesis also proposes a new region-based image similarity measure, EMD* match, which uses square-root region weights and region distance thresholding. Experimental results show that EMD* match is 27%--91% more effective than previous image similarity search techniques.
Keywords/Search Tags:Similarity search, Image, Techniques, Distance, Show
Related items