Font Size: a A A

Research On Visual Description Technology Based On Deep Learning

Posted on:2021-02-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:W B CheFull Text:PDF
GTID:1488306569483394Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the development of the big data technique in the multimedia age,images are used as an important way to carry information through the Internet and portable devices.To fully utilize the visual information properly,understanding image contents continues drawing attention from researchers.Over the past few years,the rapid development of the deep learning theory brings significant success to classical computer vision tasks,such as image classification,object detection and recognition,etc.Meanwhile,the natural language processing tasks,such as machine translation and human-computer interaction,are also greatly promoted.Inspired by these achievements,researchers begin to focus on a more challenging visual description task which presents image contents using human languages.The visual description task derives the image caption task and the visual question answering(VQA)task.The VQA task infers answers for an image-question pair.The image caption task aims to describe a given image using human languages.The former task aims to describe images in a passive-manner,since a question is required to be one of the inputs.The latter works in an active way,generating as much visual information as possible.Since both tasks involve multi-modal data distributions,it is difficult to build direct mapping between image and language signals.However,the deep learning models become one of the important solutions due to the powerful ability of feature representation.An image usually contains semantics of visual elements and their relationships,and texts can be represented by letters,words and phrases.As deep learning theories become increasingly essential in multi-modal tasks,the research on visual description is getting broader and deeper.In recent years,the application of deep learning techniques keeps promoting both VQA and image caption models.However,the difficulties become obvious.In the field of VQA,most deep learning based models are required to deal with various types of questions.Particularly,tackling counting-type questions is challenging.As for image caption missions,single-sentence caption is far from capable to describe complicated scenes.Researchers have turn their attention on multiple-sentence caption,especially generating the paragraph generation which is more challenging than generating the single-sentence caption.In this paper,the research on both VQA and image caption is extended using deep learning models.The issue of dealing with counting-type questions is analyzed and tackled by proposing a novel strategy.As for image caption,we aim to produce paragraphs which has richer information but takes more concise forms.The contribution contains these following four folds:First,we proposed a region detection based question counting model.Traditional counting methods can only learn features of a specific target by using either detection or regression strategies.They can not give the counts of objects which are queried by users.Recent proposed VQA models take all types of questions as a regression task.However,regression based methods cannot tackle counting-type questions well.In order to solve this problem,we design a localization+regression framework to identify the number of objects which are described by the input question.At the localization stage,we extract the sequential features of the question and fuse them with image features for predicting the coordinates of regions.At the regression stage,we extract the character-level features of the question,and match them with the predicted regions via a discrimination function.We calculate object numbers for each qualified region through a regression network.The difference from traditional counting methods is that our model can adapt to different kinds of targets by using question information.Compared with most VQA models,our proposed method can predict more accurate results.Second,we proposed a generative adversarial network based model to tackle counting-type questions.The regions are located first by the GAN detectors.The coordinates are inferred using the fused features of images and questions.In order to utilized question semantics of different levels,we use RNN and CNN to extract the text feature at the generator and discriminator respectively,and then fuse them with image features calculated by a pre-trained CNN.We employ a regression function to computer counts for each region.The experimental results the introduction of GAN framework has a positive effect on locating targets as well as removing redundant regions,which improves the prediction of object counts.Third,we proposed a paragraph generation network based on relationship prediction.Traditional image caption models usually use a CNN to extract visual feature maps,and then a RNN to transform these features into sentences.Such architectures can effectively produce descriptions for simple scenes,but can not deal with complicated situations which require dedicated descriptions.One solution to this issue is to detect critical regions from the whole image,and produce a sentence for each region.This method cannot detect relationships between two objects effectively,leading to losing large mounts of visual information.To solve this issue,we propose to use visual relationship to enhance the paragraph generating performance.Compared with traditional paragraph generation models,this paper explicitly predicts the visual relationships among visual elements,and then fusing their features with image features through the attention mechanism.The experimental results demonstrate that our proposed model can produce more accurate paragraphs with richer information than previous methods.Forth,we propose a generative adversarial network based model to generate paragraphs.It is demonstrated that promoting the region prediction as well as visual relationship prediction is critical in improving the paragraph generation performance.Typically,paragraph generation models employ the region proposal network which is initially used in object detection models to locate critical regions.The prediction of languages heavily depends on the region localization performance.We observe that regions with more visual information tend to produce better descriptions.This is different from object detection models which focus on detecting individual objects only.Based on this fact,we design a GAN based paragraph generation network to generate sentences that have more visual relations.We design the generator as a region proposal network which predicts critical regions,and the discriminator as optimizing parameters of the generator to make it produce regions that cover visual relationships well.The experimental results show that our GAN based region proposal network can effective improve the visual relationship prediction,which further improve the paragraph generation performance.
Keywords/Search Tags:visual description, image caption, paragraph generation, visual question answering, visual relationship, generative adversarial network, deep learning
PDF Full Text Request
Related items