| The image caption task is similar to visual discrimination exercises,which aims to generate descriptive language from an image.Image caption tasks can not only provide an effective aid for the visually impaired,but also can be applied to social media,medicine,digital library and other fields,so as to bring convenience to more people.Compared with computer vision classification,detection,segmentation and other tasks,the difficulty of image caption tasks is not only to identify the attributes of objects in the image and the interaction between objects,but also to learn human grammar knowledge and generate grammatically and semantically fluent sentences.With the rapid development of deep learning in recent years,the encoder-decoder framework based on the attention mechanism has become a general framework for image caption tasks which has been proven to be able to generate accurate descriptions according to image content.However,with the gradual saturation of follow-up research,the accuracy of the model generation caption is difficult to improve,and the score reflected in the evaluation matrix is also difficult to increase.Although the model of Reinforcement Learning for evaluation matrix has certain advantages over the models based on crossentropy in evaluation matrix,the smoothness of generating captions is greatly reduced.Therefore,this thesis doesn’t adopt the idea of Reinforcement Learning,but improves the LSTM structure and applies multiple attention mechanisms to adapt to the image caption task.The thesis proposed an improved image caption model on the basis of multi-layer attention and multi-representational attention.In the encoding process of the model,two different encoders are used to introduce various aspects of the image to optimize the original simple convolution process.Res Net with CBAM can extract the two-dimensional information of the image,namely space and channel,and Faster RCNN can extract objects’ category and outline information.In the decoding process of the model,the word embedding process is first optimized,and the BERT pre-training results are introduced to give the model a priori knowledge of the contextual relationship and grammar between words,to generate more accurate word vectors with semantic information.What’s more,the decoder is optimized by multi-layer attention,multi-representational attention,and double-layer LSTM structure,and visual analysis proves that the introduced attention mechanism can make the model correctly generate descriptions according to image information.The model has achieved greater results in different evaluation metrics compared with that of the papers in the past two years.Finally,this thesis collects and proposes the Shutterstock dataset for image caption tasks including 7 million pictures and one caption for each image.The size of the dataset far exceeds that of MS COCO,Flickr30 k and other common datasets,and we also proved the availability of the dataset. |