Font Size: a A A

Collaborating General And Specific Semantics For Multi-feature Based Image Captioning

Posted on:2020-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2428330602452521Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the dramatic increase in Internet bandwidth and various mobile devices,image data has been generated,released and spread in fast speed under the Web 2.0 technology,which has become an indispensable part of today's big data.However,some images on the Internet are untagged,in order to store,manage,retrieve and utilize these data more efficiently,researchers have been worked on automatically describing image contents with complete sentences,viz.image captioning,in recent years.However,image captioning is very challenging.It not only needs to capture the visual representation of objects and scene presented in images and express the relationship between them,but also needs to describe them with appropriate natural language.In address those problems,we systematically study the deep learning based image captioning.The main research contributions are as follows:(1)We propose a LSTM-based image captioning framework to generate sentence sequence from multi-feature sequence.In order to describe the image features more comprehensively,we trained one Res Net152 on Image Net dataset to extract object features and one Res Net152 on Places365 dataset to extract scene context features,respectively.Then,we use these two complementary features to fully represent the object features and scene context features in images.Besides,we use the multi-instance attribute classifiers trained on MSCOCO dataset to extract the semantic information in the image as a supplement of general semantic priors for image captioning.We feed the object features,scene context features and visual semantics sequentially to the encoder of LSTM to complete the feature representation in the image.Finally,the feature is translated into a language description by a LSTM decoder,which is achieved by training the framework of translating multi-feature sequences to the natural language sequences based on the cross entropy loss function.In this paper,we evaluate our model on MSCOCO dataset.The comparison results show the superiority of our algorithm over state-of-the-art approaches on standard evaluation metrics.(2)We propose a multi-feature based image captioning framework that collaborates general and specific semantics.In order to better represent the semantic features of images,we propose to extract the general semantic attributes of image through the multi-instance attribute classifier trained on MSCOCO dataset,and then retrieve similar semantics for the test image in an improved visual semantic embedding(VSE++)space as the specific semantic attributes for the image.Then,we collaborate general and specific semantic attributes as semantic priors,and sequentially feed the collaborated semantic attributes,object features and scene context features to the encoder of LSTM as the feature representation of the image.In addition,we also employ the specific semantics as the “specific semantic supervisor” for BLEU 4 similarity supervision between the candidate phrases in the decoding of LSTM,which results in the captioning method collaborating specific semantics supervision and general semantics.The evaluation on MSCOCO dataset shows the superiority of our model,which achieves better experiment results over the state-of-the-art approaches.
Keywords/Search Tags:Image captioning, convolutional neural network, long short term memory, cross-modal retrieval, general semantics, specific semantics
PDF Full Text Request
Related items