Font Size: a A A

Image Description Method Based On Deep Learning

Posted on:2020-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ZhangFull Text:PDF
GTID:2438330602452749Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Image caption is a machine that automatically translates an image into a sentence that can be understood by human beings.It is a basic problem involving computer vision,machine learning and natural language processing.Image caption is a hot topic in the field of computer vision in recent years.The system not only needs to identify the objects in the image,but also understands the properties,position and relationship between the objects in the image,and then convert this information into a sentence with a certain grammatical structure through natural language processing.Image caption is of great significance in helping people with visual impairment to reduce visual impairment,early infant education,automatic image labeling and image retrieval.With the development of deep learning,using deep learning to solve the image caption problem is the most widely used and most effective method.The image caption model proposed in this paper is also a method based on deep learning.The image caption model based on GoogLeNet and double-layer GRU,and the double-layer GRU image caption model integ,rating spatial transformation network and ResNet are proposed respectively.The contents of this study are as follows:(1)This paper introduces the background and significance of image caption,and introduces in detail the research status,main methods of image caption at home and abroad,as well as the role and significance of image caption.(2)This paper introduces the main techniques in image caption,including the principle and development history of convolutional neural network and recurrent neural network,and introduces the optimization algorithm and the technique to prevent overfitting used in model training frequently.(3)This paper proposes an image caption model based on GoogLeNet and double-layer GRU(G-GRUs),in the "encoding" phase,using GoogLeNet to extract image features,in the "decoding" phase,using GRU network model with simple structure and low computational complexity,and a double-layer GRU network is used to build language model.The double-layer GRU network structure has better memory ability for word sequences,and improves the accuracy and expression of sentences generated by the model.Meanwhile.AdamW optimization algorithm is used in the training stage.Compared with other optimization algorithms.AdamW optimization algorithm has high calculation rate,fast convergence speed and improved model performance.Experiments show that the G-GRUs model has a fast training speed,the model training time is short,and the accuracy of the generated sentences is also significantly improved.(4)This paper proposes the double-layer GRU image caption model integrating spatial transformation network and ResNet.It optimizes and improves on the basis of G-GRUs model,and optimizes the GoogLeNet in the image encoding stage into a way of combining spatial transformation network and ResNet.In the image decoding stage,the double-layer GRU language is still used to build the generation model.In the phase of image encoding,the input image is first put into the spatial transformation network,so that the spatial transformation network can directly affine the input image,so that the image can effectively learn the spatial invariance such as translation,scaling and rotation in the initial stage of the input model,and overcome the problem of image deformation in convolution process,improving the spatial robustness of the entire model.Then the output of the spatial transformation network is sent to the ResNet with deeper layers to extract image features,which makes the extracted image features more accurate and visualized.Then the extracted image features are sent to the double-layer GRU structure to generate the corresponding description of the image.Experiments show that the double-layer GRU image caption model integrating spatial transformation network and ResNet is higher than the G-GRUs model in each evaluation index,and the generated image caption is more vivid and diversified,which is more in line with human language habits.
Keywords/Search Tags:image caption, GRU, spatial transformer networks, ResNet, AdamW optimization algorithm
PDF Full Text Request
Related items