Font Size: a A A

Structured Processing Of High-Dimensional Visual Signals

Posted on:2017-11-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:B T WangFull Text:PDF
GTID:1368330590490831Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Visual signals,including images,videos and light fields,exhibit more complicated variations and abstract meanings than conventional low-dimensional signals,such as radar signals and acoustic signals,due to the high-dimensionality and influence of illumination,background clutter,scaling and non-rigid deformation.Focused on the problem of recognizing high-dimensional visual signals,this dissertation decomposes complicated targets into simple and stable visual elements for structured representation,and explores the intrinsic relations and joint modeling of the visual elements via structured prediction.From low-level vision to highlevel vision,this dissertation represents four types of visual signals,namely,exemplars,objects,scenes and captioned images,in terms of poses,parts,semantic elements and concepts in a structured manner,and performs structured prediction via transductive learning,kernel methods,multi-task learning and probabilistic graphical model for video segmentation,object recognition,scene classification and image understanding.First,this dissertation proposes a multi-component transductive video segmentation algorithm to segment the pre-defined object of interest in the frames of a video sequence.Inspired by image co-segmentation,the proposed method jointly segments multiple frames by simultaneously maximizing the inter-frame foreground similarity and minimizing the intra-frame foreground/background divergence.Specifically,segmentation proposals are drawn from the segmentation space by constrained parametric min cut,and an energy function is elaborately designed to evaluate each proposal from foreground resemblance,foreground/background divergence,boundary strength and visual saliency.Consequently,the optimal foreground mask can be obtained by assembling the segmentation proposals by Monte Carlo approximation.In addition,a multi-component foreground model is developed to capture the variation of the foreground object in the frames,which is particularly effective for videos that exhibit significant change in visual appearance.To group the frames into different components,a tree-structured model named temporal tree is designed.The temporal tree organizes the visually similar and temporally coherent frames in the same branch of the tree through probabilistic clustering,and learns the component models by transductive learning,which has stronger generalization capability than inductive approaches.Second,this dissertation proposes an object recognition method based on multi-scale partbased model and structure kernel.The multi-scale part-based model improves the deformable part based model with multi-scale part representation to capture the changes of part scales due to poses,viewpoints,intra-class diversity,etc.The model represents an object instance with a global visual feature and a set of part visual features.Moreover,the spatial configuration of parts is depicted in the three dimensional space including the two dimensional planar coordinates and the scale.Furthermore,a structure kernel is developed,which combines the discriminative capability of local kernels with the flexibility of part-based models to improve the performance of object recognition.Based on the fact that the objects of the same class should have similar global appearance,unique parts and particular spatial configuration,the structure kernel measures the similarity of two part based represented objects from global visual similarity,the visual similarity of parts and the spatial similarity of parts.It is worth mentioning that the structure kernel is flexible in parameter configuration,and can be learned in a data-driven manner to fit different object classes more accurately.Third,this dissertation proposes multi-task semantic codebook learning and context-aware image representation for scene classification.The proposed method encodes the local features of a semantic class with a distinct semantic codebook,which is capable of capturing the color,shape and texture of the semantic class more accurately.Instead of learning each semantic codebook separately,the proposed algorithm learns a compact global codebook,of which each semantic codebook is a sparse subset.On the one hand,a codeword can be shared by many semantic classes,which reflects the intrinsic relations of these classes.On the other hand,a semantic class may have some unique codewords,which reflects the distinctiveness of the class.To learn the semantic codebooks,a multi-task codebook learning algorithm is developed,which iteratively optimizes the global codebook via convex optimization and optimizes the assignment of semantic codewords via submodular optimization.Based on the learned global and semantic codebooks,a context-aware image representation is conceived to model the visual feature of the scenes by contextual quantization,semantic response computation and semantic pooling.Finally,a holistic image understanding framework is presented for captioned images by learning the cross-domain relations between texts,scenes and objects.Specifically,the relation between the texts and the objects is represented by the matching probability of the nouns and the object classes,which can be obtained by solving an instance-level constrained bilateral matching problem.In addition,the relations between the objects/texts and the scenes are represented by the frequency of occurrence of the objects/texts in particular scenes.The proposed method leverages the image-level annotations,including the scene labels and the cardinalities of the object classes,to learn the aforementioned cross-domain relations.Taking the advantage of the cross-domain relations and some off-the-shelf object detectors and scene classifiers,a holistic image understanding model is presented,which jointly reasons about the scene class of the image,the objects in the image,the cardinalities of the objects,and the location of the object instances.Specifically,a conditional random field model is established to formulate the joint probability of texts,objects and scenes.
Keywords/Search Tags:Video segmentation, image classification, transductive learning, kernel methods, multi-task learning, graphical model
PDF Full Text Request
Related items