Font Size: a A A

Research On Near-duplicate Document Image Retrieval Based On Deep Learning

Posted on:2022-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:B X XuFull Text:PDF
GTID:2518306539992059Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Near-duplicate document image retrieval plays an important role in document image analysis and understanding.It also has applications in many fields.For traditional near-duplicate document image retrieval methods,the types of variations among near-duplicate document images have to be identified beforehand,based on which different features may be manually selected for image description.Due to the complicated steps and low efficiency of traditional methods,this paper focuses on the problem of near-duplicate document image retrieval based on deep learning.The research contents mainly include the following two aspects.Firstly,this paper proposes a classification based convolutional neural network(CNN)for nearly repeated text image retrieval.The CNN is used as a feature extractor and the extracted features are used for retrieval.The ability of the features of different layers of different convolutional neural networks processed by different pooling methods to describe images is compared through experiments.Two kinds of datasets are used to fine-tune the network,and the influence of different fine-tuning datasets on retrieval performance is compared.Secondly,in this paper,a near-duplicate document image retrieval approach based on three-stream convolutional siamese network is proposed,which can learn the types of variations among near-duplicate document images automatically.The input to the proposed network is a triplet,which is composed of a query image,its near-duplicate image and non-near-duplicate image.Using the triplet loss,the distance between the query and its near-duplicate image is guaranteed to be smaller than that between query and its non-near-duplicate image,which is very reasonable.The trained network can then be employed to generate features for arbitrary images.The features are robust against the variations among near-duplicate document images.Since there are no public datasets for near-duplicate document image retrieval,two datasets are built,consisting of near-duplicate document images in Chinese and English,respectively.The near-duplicate document images differ greatly in illuminations,viewpoints and resolutions.Extensive experiments on the two newly created datasets demonstrate the effectiveness of the pro-posed approach.
Keywords/Search Tags:near-duplicate document image retrieval, feature extractor, three-stream convolutional siamese network, triplet loss
PDF Full Text Request
Related items