Font Size: a A A

Research On Image Captioning Algorithm Based On Deep Neural Networks

Posted on:2023-06-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S ZhaoFull Text:PDF
GTID:1528306914978049Subject:Cyberspace security
Abstract/Summary:
The rapid development of machine learning and deep learning highly improved the performance of both computer vision and machine translation tasks.Image captioning is a cross-modal task involving computer vision and natural language processing.The goal of this task is to automatically translate the visual features of an image into the natural language.The image captioning models has wide application scenarios,such as driverless car,human-machine interaction,navigation for the visually impaired,image content security detection.The general image captioning models can be mainly divided into the retrieval-based and the generation-based models.These two kinds of methods both have advantages and disadvantages.The image can be represented as the low-level visual features,the structured semantic scene graph or the high-level concepts in the generation-based image captioning model.And these features can be translated into natural language through the decoder.The generation-based image captioning model can hardly deep understand the image when it only utilizes the visual features or the semantic features.The retrieved descriptions by the retrieval-based image captioning models always not tailored for the image in detail.To solve the above problems,this paper proposed the unified retrieval-generation based method for image captioning,incorporating retrieval-based method for feature enhanced image captioning and the visual-semantic scene graph alignment for image captioning.The main research results and innovations of this paper are presented as follows:1.We propose a unified retrieval and generation-based method for image caption generation.Firstly,the retrieval-based image captioning method is utilized to retrieve the similar image and the corresponding annotated descriptions for each image in the dataset.Secondly,the denoising module filters out the unrelated semantic information for the annotated descriptions that are unrelated with the visual features.Finally,the decoder translates the visual relationship features and the denoised semantic features into the natural language.The visual relation features and the denoised semantic features can work as a set of the pre-processed features for other encoder-decoder based image captioning method.The proposed model utilizes the common evaluation metrics on the MSCOCO test set.The tested results demonstrated that when compared with other state-of-the-art models,the proposed model is best.The ablation studies also verified the effectiveness of the denoising module.2.We propose an incorporating retrieval-based method for feature enhanced image captioning method.Firstly,the retrieval-based method retrieves the images that are similar with the original images from the dataset and their corresponding annotated sentences.Secondly,through the designed Cross-modal Feature Distilling module enables the mutual cross-modal interactions between the encoded query images and the similar sentences,to distill the coarse aligned region-word features.Thirdly,a Gated Feature Fusion module is utilized to densely fuses the coarse aligned features,and reduce the fusion for the mismatched features according to the gated score.Fourthly,the aggregated deep interacted features are concatenated as the enhanced features.Finally,the decoder utilizes the enhanced features and the visual relationship features to understand images and translates them into descriptions.When compared with other state-of-the-art models,the proposed model acquired the competitive test result.In addition,the ablation studies demonstrated the effectiveness for different components.3.We propose a visual-semantic scene graph alignment image captioning model.Firstly,the image scene graphs and the sentence scene graphs are encoded.Secondly,a multi-scale cross-modal alignment module is utilized to align image scene graphs and sentence scene graphs at different levels.The module can filter redundant information in image scene graphs according to sentence scene graphs,and provide common sense information for the decoder.Thirdly,the boundary boxes are utilized to calculate implied spatial relations.Finally,the decoder fused the aligned scene graph features and the implied spatial relationships through the dynamic fusion attention mechanism,and translated them into descriptions.The proposed model got higher test result on the commonly used evaluation metrics when it compared with the other stateof-the-art models.The ablation studies verified the influence of different modules,and the proposed model behaves best when all the components are combined together.
Keywords/Search Tags:visual features, semantic features, image captioning, cross-modal task
Related items