Font Size: a A A

Research On Key Technologies Of Deep Representation Based Visual Understanding

Posted on:2019-11-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J W WangFull Text:PDF
GTID:1368330566987084Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Visual information is the major source of information from human perception.An important issue for the research on artificial intelligence is thus to enable computers to learn how to “see” the surrounding world and to acquire knowledge that humans need.The research of visual understanding is to enable computers to understand the underlying knowledge from visual signal.In recent years,visual data such as images and videos is growing faster than ever before.Deep Learning technique further promotes the development of computer vision,making it possible to automatically analyze the large amount of visual data.Visual representation based on Deep Learning technique(deep representation)shows impressive performance on many computer vision tasks.Focusing on deep representation based visual understanding,this thesis studies the following problems.(1)Image sentiment recognition,which involves semantic category information in visual understanding,is to analyze an image and judge the underlying human sentiment.Until now,there are several challenges for image sentiment recognition,including large intra-class variance,fine-grained recognition and scalability.The first one is attributed to the fact that different objects/scenes may convey similar sentiment.The second one comes from the fact that same objects/scenes could possibly convey different sentiment.The last one is because the labels of adjective-noun pairs could be missing.(2)Temporal action localization involves the boundary information of semantic objects in visual understanding.It is to localize all actions that happen in a video temporally and to output their starting and ending time.Existing methods either use sliding windows,which constrain the duration of detected events,or use unidirectional single-stream method,which neglects future context.(3)Dense video captioning,which involves semantic description information in visual understanding,describes the detected actions or events in natural languages based on temporal action localization.Existing methods are incapable of distinguishing highly overlapped events and the learned event representation has poor discriminative ability.Regarding the issues above,this thesis conducts a comprehensive investigation.The contributions are summarized as follows.1.This thesis proposes an end-to-end method for image sentiment recognition.The method constructs middle-level semantic representations to fill in the so-called “semantic gap.” The middle-level semantic representations are learnt with supervision of decomposed adjectives and nouns.Compared to the previous methods,it simultaneously solves the following three challenges: large intra-class variance,fine-grained recognition,and scalability.The extensive experiments on SentiBank and Twitter datasets prove the superiority of the proposed method.2.This thesis proposes a bidirectional method for temporal action localization.By a forward pass,a backward pass,and a fusion process,the proposed method exploits both past and future contexts.The proposal enjoys complete context information while the baseline method fails to exploit future context.The proposed method surpasses all other methods and achieves new state-of-the-art results on THUMOS-14 and ActivityNet Captions datasets.3.This thesis proposes a novel methanism to fuse event content and its context for dense video captioning.This methanism has two key components: bidirectional attentive fusion,and context gating.The former dynamically fuses the detected events and their contexts,enabling neural networks to find key video frames when learning to describe an event.The latter dynamically weights the detected events and their contexts by the designed gating functions.It learns to exploits contexts to some extent when generating next word for a video.The experiments show that the proposed method effectively improves the discriminative ability of the event representations.The superior performance is verified on the large-scale ActivtyNet Captions dataset.4.This thesis proposes a joint ranking method for dense video captioning.It quantitatively measures the necessary condition that a good dense captioning system should satisfy,which is that both localization and captioning are of high confidence.This thesis proposes confidence measurements for localization and captioning respectively.The proposed method selects proposal-sentence pairs with high scores using ranking technique.It greatly improves the dense captioning system.
Keywords/Search Tags:Visual Understanding, Image Sentiment Recognition, Temporal Action Localization, Dense Video Captioning, Deep Learning
PDF Full Text Request
Related items