Font Size: a A A

Research On Image Caption Algorithm Based On Transformer Architecture

Posted on:2024-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:X Z WangFull Text:PDF
GTID:2568307094958749Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Image captioning is one of the important research directions in the fields of computer vision and natural language processing.Its goal is to enable computers to automatically generate a natural language description to describe an image.This task is significant for improving the computer’s understanding of images and bridging the gap between images and language.The difficulty of image captioning lies in how to extract meaningful features from images and convert these features into natural language.Currently,the forefront of research in this field mainly focuses on improving Transformer encoder-decoder models and language pre-training models.In image captioning,the image is input into the encoder after feature extraction to generate image feature information,which is then input into the decoder to obtain a sentence describing the image content.However,existing methods lack in feature utilization,background information description,and sentence controllability.Therefore,this thesis conducted in-depth research on image captioning methods based on existing technological achievements.The main work is as follows:First of all,in order to solve the problem of insufficient feature utilization and easy neglect of image details in image captioning,this thesis improved the image captioning method using feature reconstruction module and channel self-attention.Firstly,this thesis designed an adaptive feature reconstruction module based on Transformer(AFRT),which extracts visual features using different scale convolution kernels.Then,this thesis used the adaptive channel attention module to filter important visual features and reassign word weights.Finally,this thesis fine-tuned the model using reinforcement learning.The experimental results show that the AFRT model based on Transformer can generate more detailed and accurate image descriptions.Secondly,this thesis proposed a novel image captioning encoder-decoder enhancement model(TEDL)that incorporates diffused text to solve the problem of difficult fine-tuning of BETR pre-training models and disorderly generation of some sentences.Firstly,this thesis used the Vision Transformer backbone network to extract fine-grained information of grid features,and replaced the original Bert network structure with the new Deberta network structure to further improve the decoding effect of the Transformer model.Then,this thesis used text diffusion to supervise the generation process of image captioning sentences and reduce the disorderly situation of sentence descriptions.Finally,this thesis fine-tuned the model using reinforcement learning.The experimental results show that the fine-tuning process based on TEDL is more stable and the semantic representation of images is more accurate and orderly.
Keywords/Search Tags:Image Captioning, Convolutional Neural Network, Transformer, Attentional Mechanism, Diffusion Model
PDF Full Text Request
Related items