Font Size: a A A

Research On Key Technologies For Image Deduplication

Posted on:2015-07-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:M ChenFull Text:PDF
GTID:1228330467963676Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the explosive growth and centralized storage of data, the problem of storage space waste caused by duplicate data is becoming more and more serious. This situation leads to the emergence and development of deduplication. Deduplication is an effective way to improve storage utilization by eliminating redundant data. At present, it has been widely used in backup and archive systems. However, most of the existing deduplication systems can only eliminate exactly the same images, but it is unavailable to duplicate images which have the same visual perceptions but different codes. With images gradually becoming an important part of modern data resources, how to detect and eliminate duplicate images based on image content becomes an important issue in the modern storage field.Image Deduplication mainly includes two problems:one is duplicate image detection and another is duplicate image elimination. At present the content-based duplicate image detection technology can solve the first problem to some extent. But the retrieval accuracy is not high. For the second problem, there is no an effective method to select centroid. So at present deduplication is implemented by manual intervention. Combined with the characteristics of image deduplication, the research of this paper focuses on the accuracy of duplicate image detection and the selection of centroid image, and reaches the following results: 1、In order to solve the low retrieval accuracy in simple duplicate image detection. This paper presents a simple duplicate image detection approach based on multiple filtering technologies. This method firstly uses perceptual hashing to build the index. Then the multiple filtering is executed from a plurality of angles, such as spatial structure, color, texture. Experimental results show that, due to the good cohesiveness and complementarities between multiple filtering, this algorithm can not only maintain a high recall rate, but also can meet the accuracy requirements of image deduplication.2、In order to solve the low discrimination of image representations in complicated duplicate image detection. This paper presents a complicated duplicate image representation approach based on descriptor learning. This approach firstly formulates objective function as minimizing empirical error on the labeled data. Then the tag matrix and the classification matrix of training dataset are brought into the objective function to ensure semantic similarity. Finally, by relaxing the constraints, we can get the learning hashes. The learning hashes are used to quantify local descriptors of images into binary codes and the frequency histograms of binary codes are as image representations. Experimental results demonstrate that compared with the state-of-the-art algorithms, this approach can effectively improve the discrimination of image presentations by introducing semantic information.3、The retrieval accuracy is unsatisfactory for similar images in complicated duplicate image detection. In order to solve the problem, this paper proposes a complicated duplicate image detection approach based on two-dimensional cloud model calibration. The approach first maps the matching descriptors which are refined by Hamming embedding to points in the two-dimensional space, and then uses cloud model to compute the uncertainty of two-dimensional points’distribution to exclude the candidate images which have volatile distributions. Finally, images are ranked according to voting score. The experimental results show that the new approach not only maintains the merit of weak geometric consistency algorithm which is suitable for the large-scale image retrieval, but also effectively improves the accuracy of duplicate image detection. 4、To be able to select the centroid image automatically in duplicate image sets, this paper proposes a centroid selection method based on fuzzy logic reasoning. This approach firstly designs rules according to the characteristics of human visual perception and the purpose of image deduplication. Then the image attribute information are used to reason a comprehensive quantitative value by simulating human thought patterns. Finally the comprehensive quantitative value is exploited to select the centroid image. Experimental results demonstrate the approach can accurately find the centroid image.5、On the basis of the studies, in order to achieve the large-scale image deduplication, this paper designs a whole deduplication framework based on Hadoop for simple duplicate images and complicate duplicate images. This framework divides into two phase:online deduplication and offline deduplication. In online deduplication stage, this framework exploits Hbase to achieve preliminary and rapid deduplication. In offline deduplication stage, this framework exploits MapReduce to further detect duplicate images, and the results of centroid selection are recommended to users. In this framework, online deduplication can filter most simple duplicate images which reduce the workload of offline deduplication, and offline deduplication can improve deduplication rate of the overall system by further detecting duplicate images.
Keywords/Search Tags:perceptual hashing, centroid-image, fuzzy logicreasoning, hamming embedding, weak geometric consistencyconstraints, semi-supervised learning
PDF Full Text Request
Related items