Font Size: a A A

Research On Image Description Based On Multimodal Recurrent Network

Posted on:2019-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ShuFull Text:PDF
GTID:2438330551460792Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Visual information plays an important role in the information acquired by human beings.With the improved technology of digital imaging and large capacity storage,digital images have become the most important carrier of vision information,and describing images using sentences has attracted more and more attention.Image captioning needs to not only identify the objects in an image,but also describe properties of these objects and their corresponding relationship.Therefore,the sentence description for an image can contain rich information.In the past,there are many research on image description,such as the methods of traditional template matching and similarity retrieval.Recently,using deep neural networks has become the leading method for image captioning.At present,many novel models for image captioning are put forward.The research work in this paper is based on multimodal recurrent neural network,which includes following two parts:1)A bidirectional multimodal recurrent neural network is proposed.If we unfold the traditional multimodal recurrent neural network at different time steps,it is easy to find that the words generated at each step are related to the words before them.However,every word in the sentence is related to not only the preceding words,but also the words behind them.The proposed bidirectional multimodal recurrent network is trained based on bidirectional statement sequence,and the final description is chosen according to the loss function.Experimental results demonstrate that the improved model can get better performance.2)The performance of the model can get improved with spatial and textual features.Image features are sent into the multimodal recurrent neural network at different time steps directly.However,different weights can be given to each region of an image to represent the difference of attention.In addition,image features can be combined with textual features at different time steps in order that the image features become time-dependent.The results get further improved based on features fusion.The sentences generated by the improved models show that the proposed methods are effective.
Keywords/Search Tags:image captioning, multimodal recurrent neural network, bidirectional statement sequence, spatial features, textual features
PDF Full Text Request
Related items