Font Size: a A A

Research And Implementation Of Key Technologies Of Image Caption Based On Deep Learning

Posted on:2021-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y L JiFull Text:PDF
GTID:2518306050455214Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The purpose of Image Caption is to automatically generate text describing the current image.In the past few years,it has become a topic that people are increasingly interested in.The technology of Image Caption can enhance the image search capabilities of search engines,assist visually impaired people to better understand the surrounding environment,or explain people's shared images through social media platforms to identify customers' interests to improve online marketing and customer segmentation tasks.In a word,the technology of Image Caption plays an important role in all aspects.This task is an emerging task which involves computer vision and natural language processing technologies,and also makes an important contribution to the study of multi-modal interaction.However,the technology of Image Caption wants to generate a smooth natural language description just like people,it not only needs to recognize the objects in the image,but also understand the relationship between the objects,including the movement and form of each object,and according to these information,translate into text descriptions,so this is a very complex and challenging task.The traditional Image Caption methods are mainly based on generation and retrieval.The limitations of these methods are that the model is too dependent on the image feature processing in the early stage and ignores the text generation process of the language model,so the result of the model is unsatisfactory.Recently,deep neural network based encoderdecoder framework has been widely applied to image caption and has made great progress.This thesis focuses on how the encoder-decoder model can be effectively used in the Image Caption task,and the ability of deep convolutional neural networks to extract image features,the processing of word embedding models,and the optimization of the attention model on the whole model,We study on the deep learning based image caption,and the main work is:1)The technology of Image Caption needs to extract the features of the image.If the selected features are not representative enough,it is difficult to distinguish the objects in the image and their relationships.One of the better ways in traditional process is to use a variety of feature extraction and then combine them to achieve a better effect of feature extraction.This method requires a lot of heuristic rules and adjust parameters according to different fields.To solve this problem,this paper proposes a learning method of deep convolutional neural networks based on transfer learning.2)Word embedding can be simply divided into word representation and sequence identification.In the field of natural language processing,word embedding technology,represented by Word2 Vec and Glove,has been used as a processing method to represent a single word as a vector representation that a computer can process.However,word embedding treats polysemous words as a single representation,so it is not accurate.With the rise of deep learning,more and more word embedding models have been proposed,such as: ELMo,Open AI GPT,and BERT.These models use the deep network model in the pretraining stage to obtain the semantic information in the natural language through unsupervised learning,and then deploy the model to the downstream task through the migration learning in the fine-tuning stage.Aiming at the problem of polysemous words,this paper proposes an Image Captions technology based on Bert tword embedding model.3)The encoder-decoder-based model is the main method for Image Captions.This method has a simple structure,but it is often difficult to generate high-quality sentences.Each time you input global image information for training,it will also make the model training speed Slow down.In this paper,a bi-directional Image Captions model based on attention mechanism is proposed,and a text-guided image feature extraction module is introduced in the encoder stage,and an image-guided text generation module is introduced in the decoder stage,which greatly improves the description accuracy of automatic annotation statements.
Keywords/Search Tags:Image Caption, deep learning, convolutional neural network, long and short-term memory network, attention mechanism, transfer learning, feature learning
PDF Full Text Request
Related items