Font Size: a A A

Research On Large Scale Multimedia Data Retrieval Based On Hashing Algorithm

Posted on:2016-09-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M JinFull Text:PDF
GTID:1108330470967834Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nearest neighbor search is a fundamental problem in machine learning community, which has been applied in various fields, including information retrieval, data mining and computer vision. For example, nearest neighbor search can be used in image search, face recognition and k-means clustering. Recently, large scale multimedia data has appeared with the growth of internet. Generally, the features extracted from these datum are high dimensional and dense. Thus, how to conduct nearest neighbor search efficiently in the large scale multimedia dataset is considered as a very important research topic in the academia and industry. Currently, nearest neighbor search has two main directions, tree based method and hashing based method. Many tree based methods partition the feature space in the manner of tree, its time complexity and space complexity exponentially grow with the increase of dimension d. In the contrary, hashing algorithm does not have this limitation. It generates hash functions (preserving similarity, i.e. if two points are similar in the original feature space, then they have small hamming distance in the hamming space, and vice versa.) and utilizes them to compress a d dimensional feature into a c-bits (generally, c<<d) binary code, then performs search by using hamming ranking or hash table. It is obvious that one can use hardware (XOR) to accelerate the procedure of hamming ranking, and the time complexity of hash table based search is O(1) (is not related with d). Thanks to the advantages of high computational efficiency and insensitivity of dimension, hashing algorithm attracts interest from many researchers. This thesis is mainly about the problem of hashing based nearest neighbor search for large scale multimedia data.(1) Efficient search based hash function learning:Hashing based nearest neighbor search method is fundamentally based on hash function. A good hash function aims at obtaining high precision with a short code. Random projection based hashing algorithm does not consider the data distribution and have to generate long code to produce a satisfactory result. And learning based hashing algorithm learns projections from the data distribution and can achieve better performance with shorter code. We study the metrics of a hash function, and propose the Complementary Projection Hashing (CPH) algorithm, which preserves the nearest neighbor property (for obtaining high precision) and considers the hash bucket balance (for obtaining high speed) simultaneously. The proposed algorithm lays the foundation of hashing based high efficient search.(2) Cross-media data based hash function learning:It is important to develop the methodology of cross-media data based hash function learning in the era of large scale multi-media data (such as text, image, video, audio). It requires that the relevance between different kinds of media datum can be expressed within hash coding (i.e. the media datum with same concept must have same code). To this end, we propose the Iterative Multi-View Hashing (IMVH) algorithm considering both within-modal similarity preservation and cross-modal similarity preservation. The similarity preservation not only makes the similar points have similar codes, but also the dissimilar points have dissimilar codes (distinctiveness). The proposed method provides a general framework for hashing based cross-media efficient search.(3) Optimal hash table based fast nearest neighbor search:Currently, the performance of hash table based nearest neighbor search method is still not better than KD-tree based.method. The main reason is that the existing hashing algorithms have to use a large hamming radius for obtaining high precision and high recall, which enormously increases the search time. To address this limitation, we propose the Iterative Expanding Hashing (IEH) algorithm, which uses an auxiliary index to expand the points obtained from a small hamming radius, and finally achieves high precision, high recall and low search time. This method can essentially improve the performance of hash table based method, and provide a strong guarantee for online real-time large scale information retrieval.
Keywords/Search Tags:Big Data, Nearest Neighbor Search, Hashing Algorithm
PDF Full Text Request
Related items