Font Size: a A A

Deep Learning For Image Captioning

Posted on:2019-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:S LiuFull Text:PDF
GTID:2428330611493339Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
With the development of network communication and multimedia technology,the way people acquire knowledge and communicate with each other is also undergoing earth-shaking changes.More and more multimedia information such as text,images and videos are constantly pouring into people's vision.Image Captioning is a technique for multi-modal processing of image and text,which combines two key areas of computer vision and natural language processing to realize the transformation from image to text.It has many applications such as image retrieval and network image analysis.This paper adopts the framework of encoder-decoder which automatically generates descriptions for a given picture by learning the characteristics of images and the sentences in the data set.The model involves two kinds of deep neural networks,including CNN and RNN,which have been widely used in machine learning in recent years.This paper proposes an adaptive attention mechanism based on text traction.The structure is applied to the model framework based on CNN-RNN and CNN-CNN respectively,so that the model can think like humans and dynamically assign different areas to images when generate related words.The work and research results of this paper mainly include the following aspects:(1)In the task of image captioning,this paper proposes a method to find the representative text-guided feature according to the given image to solve the heterogeneity of underlying features between images and text.Given a query image,the text traction vector is obtained as a bridge between the image and the text modal data by a series of operations: finding the nearest neighbor images,the selection of “consensus sentence” and feature mapping.The text traction vector is a bridge between image and text during the process of image captioning.(2)This paper designs a CNN-RNN framework based on the text traction attention mechanism.The description of an image depends on visual information and language model.In this paper,the text-guided vector is merged into the attention mechanism,so that the decoder can adaptively adjust the visual concentration area,thereby generate more natural descriptions and effectively improve the experimental results.(3)This paper designs a CNN-CNN framework based on the text traction attention mechanism.The parallel computing of the CNN model in the deep learning framework and the advantages of GPU acceleration enable the CNN to utilize and stack multiple network layers instead of circular paths to memorize the context information.The experiment analyzes the influence of layer number and kernel size,and the quality of the generated descriptions,the training and test time of the two model architectures.
Keywords/Search Tags:text-guided, deep learning, recurrent neural networks, convolutional neural networks, attention mechanism
PDF Full Text Request
Related items