Font Size: a A A

Semantic Describing And Understanding For Imagery Content

Posted on:2019-04-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:A H YuanFull Text:PDF
GTID:1368330596956545Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Imagery content learning and understanding is an important part of artificial intelligence.On the one hand,taking imagery as research objects,the computer system automatically obtains the content information expressed by the image by using computer programs and artificial intelligence methods to study the types of objects contained in the image,mine the target's attributes and the relationship between the objects.On the other hand,natural language is an important tool for human communication,which is also an important symbol of intelligence.Therefore,we also hope that computers can not only learn and understand the content of imagery,but also be able to describe,reason and answer questions exactly with natural language just like human beings.This is consistent with a dream in the artificial intelligence field---let computers understand the rich visual world around can communicate with us using natural language.Imagery content learning and understanding has become an important interdisciplinary subject in the field of artificial intelligence and computer vision,and it has naturally been widely studied by researchers.This dissertation aims at this topic and mainly focuses on two subtasks—image captioning and visual question answering tasks,which connect the imagery content with the human language.This is also a step that must be taken to move towards real artificial intelligence.This dissertation mainly studies the methods to accomplish the above two tasks.The main contents and contributions are summed up as follows:(1)Multi-modal gated recurrent unit for image caption generation.The traditional non-deep learning algorithms for image captioning have some shourtages: 1)the descriptions genrated by these traditional methods habe a fixed length,2)the sentences are less variety and 3)th sentences generated by these methods cannot describe the images very accurate.The proposed methods uses the powerful and popular deep neural network to realize the image captioning.First,the image is encoded by a deep convolutional neural network which can extract more discriminative and expressive image global features.The gated recurrent unit is used as a multimodal embedding and sentence generator module.This model can not only generate length variable,rich style natural sentences,but also can fully exploit the multi-modal mapping relationship between natural language and images.Furthermore,to better simulate the nonlinear relationship between images and natural language,we increase the depth of recurrent units.The algorithm is validated on the three main image captioning datasets.The experimental results show that the algorithm can well realize the "translation" from image to text.(2)3G structure for image captioning.As the image captioning algorithms based on global image features can only learn the multi-modal mapping relationship between the entire image and the entire description statement,it is clear that this kind of mapping too rough,we need to further find the fine-grained mapping relationship between the local regions of the image and the natural language elements.The proposed algorithm uses the visual attention mechanism to mine this correspondence,the generated word is used to select the image regions at each time-step.At the same time,the existing methods of using attention mechanisms only use image local features,but abandon the global features of the image.However,the global features of the image contain global information of the image,which is an important supplement to the local features.Furthermore,the local features will be suffer from the issue of object scaling.The proposed algorithm fuses the global and local image features through a multimodal fusion module.In addition,in order to further improve the performance of the multi-modal fusion and language module in the existing methods,we have used the gated feedback strategy to increase the deepth of the long-short term memory network.The experimental results show that the proposed algorithm performs well in image captioning task.(3)Vision-to-language tasks based on attributes and attention mechanism.In order to alleviate the problem of cross-modal semantic gap between image and natural language,the proposed algorithm uses image attribute information as a "bridge" between image and language.The algorithm mainly consists of two levels of attention network---semantic-guided attention network and text-guided attention network.The former is used to highlight the regions related to image attributes and the attributes related to the image regions.The latter is used to find the mapping relationship between natural sentences and image parts.The algorithm has two branches,which are used for image captioning and visual question answering tasks,respectively.The related experiments are also performed on these two types of datasets.The results show that the algorithm improves the accuracy of image captioning and visual question answering.
Keywords/Search Tags:Image Captioning, Visual Question Answering, Multi-modal Learning, Recurrent Neural Network
PDF Full Text Request
Related items