Font Size: a A A

Deep Progressive Learning For Fine-Grained Visual Understanding

Posted on:2020-04-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y C YanFull Text:PDF
GTID:1368330623463983Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the large-scale construction of video surveillance networks and the accelerated popularization of mobile devices,image and video data have shown explosive growth.Due to the lack of analytical techniques,a large number of images and videos become useless garbage data that “sleep”in the storage system.In order to improve the usefulness of massive images and videos,it is urgent to study automatic analyzing and understanding techniques for visual data.Deep learning has greatly promoted the development of computer vision since 2012,providing an important research approach for analyzing visual data.However,the existing methods can only achieve satisfactory results for generic visual understanding tasks,which greatly limits the further development and application of computer vision.This dissertation proposes the idea of progressive learning to address these issues,focusing on how to achieve a high-level and fine-grained understanding of visual data.From the perspective of tasks,this dissertation focuses on fine-grained visual understanding tasks,including instance-level understanding,fine-grained categorization and pixel-level understanding.From the perspective of research methods,this dissertation proposes the idea of progressive learning,which divides the detailed semantic information into several progressive stages.Based on this idea,a series of progressive models are proposed to adapt to different levels of visual tasks.The main contributions and innovations of this dissertation are as follows:(1)This dissertation proposes a novel deep learning framework,namely deep progressive learning.To address the problem that traditional single-stage deep learning framework cannot effectively model detailed features of objects,the proposed deep progressive learning framework decomposes the task into multiple progressive stages and models the detailed semantic information at each stage.This dissertation defines three important characteristics of deep progressive learning,namely,configurability,scalability and superiority.Based on the above three characteristics,this dissertation designs a complete deep progressive learning framework and applies it to a series of tasks.The results show that the proposed deep progressive learning framework can be widely applied to various of computer vision tasks and outperforms the performance of traditional deep models.(2)Temporal progressive learning for person re-identification: For a pedestrian appeared in the video,a single video frame can only contain limited identity information.How to effectively extract and fuse the discriminative information in different video frames is the key problem in person re-identification task.Existing methods often fail to make use of effective information in temporal video sequences.This dissertation proposes a progressive learning model from the perspective of temporal information fusion.Aiming at effectively integrating temporal features in this task,we propose a recurrent feature fusion model(RFA-Net)based on long short-term memory networks.At each time step,the feature fusion network accepts the pedestrian features as input,and gradually aggregates the useful features into a highly discriminative sequence representation.We validate the effectiveness of RFA-Net on three public datasets,and the results show that the proposed model can achieve better results than traditional fusion methods based on both hand-crafted features and CNN features.(3)Spatial progressive learning for fine-grained image recognition: The major challenge of fine-grained image recognition is the large intra-class differences and the high inter-class similarity.How to effectively discover and compare the details of fine-grained categories is the key challenge of this task.The most typical solution is to use a single CNN network to classify the input image.However,such models cannot effectively extract detailed object features.In order to address the issue of detail feature mining and feature fusion,we propose a progressive learning approach.We take a recurrent attention network to sequentially locate different object part regions,and use a long short-term memory network(LSTM)to fuse the features of different parts to yield more discriminative image-level features.(4)Interactive progressive learning for human interaction prediction: Human interaction prediction is a challenging fine-grained understanding task.The core challenge of this task is how to effectively model the interaction between individuals and find the most discriminative areas in the scene to distinguish fine-grained action categories.In order to solve the above two problems,this dissertation proposes a progressive model to predict the interaction.We first propose a coupled model that can effectively consider the interaction between interacting individuals.Second,we propose a relative attention model,which tries to find the most discriminative spatial region for motion recognition,thus further improving the performance of the model.The proposed model can be seamlessly embedded into the classic action recognition framework for end-to-end learning.(5)Modality progressive learning for video generation: Video generation is a more challenging visual understanding task as it requires pixel-level understanding of visual data.The challenge is how to find a reasonable data distribution in a huge solution space.Traditional video generation algorithms lack effective constraints on the structure of the foreground objects,which results in great deformations and blur effects in the generated videos.In this dissertation,we propose a structure progressive model for video generation.The proposed model consists of two progressive steps.First,we propose to generate a sequence of object key points according to a predefined motion pattern.Then we use the key point sequence as structural constraints to generate different poses of the object,which are combined into a video.Experimental results show that the proposed algorithm can generate realistic video sequences.Overall,this dissertation proposes deep progressive framework and applies it to different fine-grained visual understanding tasks.For instance-level person re-identification,we propose a temporal progressive learning model.For fine-grained images and videos understanding tasks,we propose spatial progressive model and interactive progressive model,respectively.For pixel-level video generation tasks,we propose a modality progressive model.A large number of experimental results and extensive theoretical analysis show that the progressive learning method proposed in this dissertation has shown superiority in different levels of fine-grained visual understanding tasks.
Keywords/Search Tags:Progressive learning, long short-term memory network, convolutional neural network, feature fusion, end-to-end learning, attention model, generative model
PDF Full Text Request
Related items