Research On Image Captioning Models Based On Deep Learning

Posted on:2024-09-30

Degree:Master

Type:Thesis

Country:China

Candidate:L M Cui

Full Text:PDF

GTID:2558307067968369

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The task of image captioning involves automatically generating a descriptive text that accurately represents the main content of a given image while being fluent in language.It is a cross-modal research task that requires understanding and generation of both image and text modal data.Research on image captioning has significant theoretical and practical implications for multimodal retrieval,autonomous driving,and natural language generation.However,most image captioning research is focused on English,with less attention given to Chinese.As a result,there is much room for improvement in Chinese image captioning research.This paper aims to address this gap by conducting research on Chinese description generation.Most of the current image captioning algorithms use convolutional neural networks and visual Transformer structures to extract feature vectors of a given image,which have high complexity and computational redundancy due to the structural characteristics of both networks.Meanwhile,the decoders of these methods all use a single-round decoding structure for describing text generation,but this approach suffers from inadequate decoding and error accumulation computation.To address the above problems,the research in this paper focuses on the construction of Chinese-oriented image captioning algorithms,and the main work and innovations include:1.A wave-fusion-based generation model for image captioning of visual multilayer perceptron is proposed.The encoder part of the model consists of only multilayer perceptron structure and activation layers stacked,which greatly simplifies the model structure and reduces the number of training parameters.At the same time,a wave function is used to represent the image blocks,which is used to dynamically interact with the spatial information of the images,which is conducive to the model to extract image features more efficiently.A language decoder based on the memory attention mechanism is proposed,which can effectively simplify the computation of traditional attention mechanisms.The mechanism also tracks the overall characteristics of the dataset during the training process and thus balances the influence of other samples to improve the quality of model generation.The experimental results show that the results of the model have a significant improvement compared to the baseline model for several of the compared metrics.2.A model for image captioning based on knowledge guidance and dynamic multi-round decoding is proposed.A structure for dynamic multi-round decoding was constructed to determine the re-decoding operation using a judging mechanism,which is capable of polishing the initially generated text vector,making full use of the image information in the process while correcting previously generated errors.A semantic enhanced attention mechanism is proposed,which is used to explore the incorporation of external linguistic knowledge into the decoding vector to guide the decoding process of the model as a way to enhance the language generation capability of the decoder.Finally,it is shown experimentally that the quality of the text generated by the model is further improved on several metrics that are of high relevance to humans.

Keywords/Search Tags:

Image captioning, Cross-modal studies, Dynamic multi-round decoding, Visual multilayer perceptron, Knowledge fusion

PDF Full Text Request

Related items

1	Research On Social Image Captioning Based On Deep Learning
2	Research On Visual Perception Technology Based On Multi-modal Fusion
3	Research On Image Captioning Algorithm Based On Deep Neural Networks
4	Image Captioning Theories And Methods
5	Research On Visual Captioning Algorithm For “Visual-Linguistic” Cross-Modal Semantic Alignment
6	Research On Semantic Attribute Based Visual Semantic Image Captioning Method
7	Deep Multimodal Attention Learning For Image Captioning
8	Recommendation Algorithm Based On Multi-Modal Fusion
9	Research On Multi-feature And Multi-modal Video Captioning Based On Deep Learning
10	Multi-modal Dense Video Captioning Method Based On Transformer