Research On Key Technologies For Image Deduplication

Posted on:2015-07-22

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M Chen

Full Text:PDF

GTID:1228330467963676

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As the explosive growth and centralized storage of data, the problem of storage space waste caused by duplicate data is becoming more and more serious. This situation leads to the emergence and development of deduplication. Deduplication is an effective way to improve storage utilization by eliminating redundant data. At present, it has been widely used in backup and archive systems. However, most of the existing deduplication systems can only eliminate exactly the same images, but it is unavailable to duplicate images which have the same visual perceptions but different codes. With images gradually becoming an important part of modern data resources, how to detect and eliminate duplicate images based on image content becomes an important issue in the modern storage field.Image Deduplication mainly includes two problems:one is duplicate image detection and another is duplicate image elimination. At present the content-based duplicate image detection technology can solve the first problem to some extent. But the retrieval accuracy is not high. For the second problem, there is no an effective method to select centroid. So at present deduplication is implemented by manual intervention. Combined with the characteristics of image deduplication, the research of this paper focuses on the accuracy of duplicate image detection and the selection of centroid image, and reaches the following results: 1、In order to solve the low retrieval accuracy in simple duplicate image detection. This paper presents a simple duplicate image detection approach based on multiple filtering technologies. This method firstly uses perceptual hashing to build the index. Then the multiple filtering is executed from a plurality of angles, such as spatial structure, color, texture. Experimental results show that, due to the good cohesiveness and complementarities between multiple filtering, this algorithm can not only maintain a high recall rate, but also can meet the accuracy requirements of image deduplication.2、In order to solve the low discrimination of image representations in complicated duplicate image detection. This paper presents a complicated duplicate image representation approach based on descriptor learning. This approach firstly formulates objective function as minimizing empirical error on the labeled data. Then the tag matrix and the classification matrix of training dataset are brought into the objective function to ensure semantic similarity. Finally, by relaxing the constraints, we can get the learning hashes. The learning hashes are used to quantify local descriptors of images into binary codes and the frequency histograms of binary codes are as image representations. Experimental results demonstrate that compared with the state-of-the-art algorithms, this approach can effectively improve the discrimination of image presentations by introducing semantic information.3、The retrieval accuracy is unsatisfactory for similar images in complicated duplicate image detection. In order to solve the problem, this paper proposes a complicated duplicate image detection approach based on two-dimensional cloud model calibration. The approach first maps the matching descriptors which are refined by Hamming embedding to points in the two-dimensional space, and then uses cloud model to compute the uncertainty of two-dimensional points’distribution to exclude the candidate images which have volatile distributions. Finally, images are ranked according to voting score. The experimental results show that the new approach not only maintains the merit of weak geometric consistency algorithm which is suitable for the large-scale image retrieval, but also effectively improves the accuracy of duplicate image detection. 4、To be able to select the centroid image automatically in duplicate image sets, this paper proposes a centroid selection method based on fuzzy logic reasoning. This approach firstly designs rules according to the characteristics of human visual perception and the purpose of image deduplication. Then the image attribute information are used to reason a comprehensive quantitative value by simulating human thought patterns. Finally the comprehensive quantitative value is exploited to select the centroid image. Experimental results demonstrate the approach can accurately find the centroid image.5、On the basis of the studies, in order to achieve the large-scale image deduplication, this paper designs a whole deduplication framework based on Hadoop for simple duplicate images and complicate duplicate images. This framework divides into two phase:online deduplication and offline deduplication. In online deduplication stage, this framework exploits Hbase to achieve preliminary and rapid deduplication. In offline deduplication stage, this framework exploits MapReduce to further detect duplicate images, and the results of centroid selection are recommended to users. In this framework, online deduplication can filter most simple duplicate images which reduce the workload of offline deduplication, and offline deduplication can improve deduplication rate of the overall system by further detecting duplicate images.

Keywords/Search Tags:

perceptual hashing, centroid-image, fuzzy logicreasoning, hamming embedding, weak geometric consistencyconstraints, semi-supervised learning

PDF Full Text Request

Related items

1	Research On The Application Of Geometric Information In The Semi-supervised Learning
2	Bootstrap Dual Complementary Hashing With Semi-supervised Re-ranking On Large Scale Image Retrieval Problem
3	Semi-supervised Metric Learning Based Anchor Graph Hashing For Large Scale Image Retrieval
4	Research Of Group Lasso-Based Semi-Supervised Hashing For Image Retrieval Optimization And Algorithm
5	Research On Image Retrieval Algorithm Based On Semi-supervised Hashing Algorithm
6	Semi-Supervised Learning On 1-D Embedding Space
7	A Research On Semi-Supervised Deep Hashing With A Bipartite Graph
8	Semi-supervised Neighborhood Preserving Embedding Used In Hyperspectral Image Classification
9	Research On Image Retrieval Based On Perceptual Hashing
10	Research On Image Tampering Detection Technology Based On Perceptual Hash