Research On Near-duplicate Document Image Retrieval Based On Deep Learning

Posted on:2022-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:B X Xu

Full Text:PDF

GTID:2518306539992059

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Near-duplicate document image retrieval plays an important role in document image analysis and understanding.It also has applications in many fields.For traditional near-duplicate document image retrieval methods,the types of variations among near-duplicate document images have to be identified beforehand,based on which different features may be manually selected for image description.Due to the complicated steps and low efficiency of traditional methods,this paper focuses on the problem of near-duplicate document image retrieval based on deep learning.The research contents mainly include the following two aspects.Firstly,this paper proposes a classification based convolutional neural network(CNN)for nearly repeated text image retrieval.The CNN is used as a feature extractor and the extracted features are used for retrieval.The ability of the features of different layers of different convolutional neural networks processed by different pooling methods to describe images is compared through experiments.Two kinds of datasets are used to fine-tune the network,and the influence of different fine-tuning datasets on retrieval performance is compared.Secondly,in this paper,a near-duplicate document image retrieval approach based on three-stream convolutional siamese network is proposed,which can learn the types of variations among near-duplicate document images automatically.The input to the proposed network is a triplet,which is composed of a query image,its near-duplicate image and non-near-duplicate image.Using the triplet loss,the distance between the query and its near-duplicate image is guaranteed to be smaller than that between query and its non-near-duplicate image,which is very reasonable.The trained network can then be employed to generate features for arbitrary images.The features are robust against the variations among near-duplicate document images.Since there are no public datasets for near-duplicate document image retrieval,two datasets are built,consisting of near-duplicate document images in Chinese and English,respectively.The near-duplicate document images differ greatly in illuminations,viewpoints and resolutions.Extensive experiments on the two newly created datasets demonstrate the effectiveness of the pro-posed approach.

Keywords/Search Tags:

near-duplicate document image retrieval, feature extractor, three-stream convolutional siamese network, triplet loss

PDF Full Text Request

Related items

1	Research And Implementation Of Image And 3D Shape Retrieval Algorithms Based On Deep Learning
2	Research On Partial-duplicate Image Retrieval Algorithms Based On The Multi-contextual Clues
3	Instance-level Image Retrieval Based On Convolutional Neural Network
4	Research On Document Image Retrieval Technology Based On Combined Feature
5	The Research And Application Of 3D Shape Retrieval Based On Sketches
6	Streamlined Feature Representation For Content-based Image Retrieval
7	Near Duplicate Video Detection Based On Short Video
8	Research On Face Recognition Based On Machine Learning Method
9	Research And Application Of Feature-based Document Image Retrieval
10	Document Image Classification And Retrieval Based On Convolutional Neural Networks