Font Size: a A A

Research On Spatial-Temporal Feature Extraction

Posted on:2017-05-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:J MiaoFull Text:PDF
GTID:1108330503485224Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Video content recognition is an essential problem in computer vision. It can be apply to intelligent video surveillance, human computer interaction, video indexing, etc. Feature representation is important for video recognition. Due to the complexity of videos, feature representation will be affected by viewpoint, background, time, etc. Therefore, it is hard to extract robust video representation. In recent years, much effort has been made for video feature representation, but it is still impracticable to apply video recognition widely. Conventional approaches use hand-crafted features for local feature extraction, they are insufficient for robust video representation. Conventional approaches rely on complex computation, and they cannot be applied to real-time applications. This dissertation focuses on spatial temporal feature extraction for video recognition, and improvements are made for both accuracy and speed. The main contributions of this dissertation are as follows.1. Slow feature analysis(SFA) extracts slowly varying features from input signals and has been used to model complex cells in the primary visual cortex(V1). It transmits information to both ventral and dorsal pathways to process appearance and motion information, respectively. However, SFA only uses slowly varying features for local feature extraction, because they represent appearance information more effectively than motion information. To better utilize temporal information, we propose temporal variance analysis(TVA) as a generalization of SFA. TVA learns a linear transformation matrix that projects multidimensional temporal data to temporal components with temporal variance. Inspired by the function of V1, we learn receptive fields by TVA and apply convolution and pooling to extract local features. We evaluate the proposed TVA features on several challenging data sets and show that both slow and fast features are useful in the low-level feature extraction. Experimental results show that the proposed TVA features outperform the conventional histogram-based features.2. Dynamic textures exist in various forms, e.g., fire, smoke, and traffic jams, but recognizing dynamic texture is challenging due to the complex temporal variations. We present a novel approach stemmed from slow feature analysis(SFA) for dynamic texture recognition. Fortunately, SFA is capable to leach invariant representations from dynamic textures. However, complex temporal variations require high-level semantic representations to fully achieve temporal slowness, and thus it is impractical to learn a high-level representation from dynamic textures directly by SFA. In order to learn a robust low-level feature to resolve the complexity of dynamic textures, we propose manifold regularized SFA(MR-SFA) by exploring the neighbor relationship of the initial state of each temporal transition and retaining the locality of their variations. Therefore, the learned features are not only slowly varying, but also partly predictable. Experimental results on dynamic texture and dynamic scene recognition datasets validate the effectiveness of the proposed approach.3. Traditional video recognition approaches are too slow for real-time or large-scale applications. This problem has been tackled by replacing optical flow with motion vectors from the compressed domain. Yet further usage of compressed domain information for video recognition is possible. In video compression, discrete cosine transform(DCT) coefficients, which correspond to residue data, represent information which the block based motion vectors fail to capture. We propose a set of residue boundary histograms(RBH) features by utilizing different parts of DCT coefficients. We also propose an efficient feature extraction scheme based on compressed depth maps. Each depth map is coded by breakpoints and an adaptive discrete wavelet transform(DWT). DWT coefficients describe smooth variations in depth while breakpoints communicate sharp boundaries. Both of these attributes are utilized to construct features for video representation. Experimental results on action recognition datasets shows that the proposed scheme is computationally more efficient when compared with conventional approaches, while a competitive recognition accuracy can be achieved.Overall, on the one hand, this dissertation proposed novel spatial temporal feature extraction approaches for better recognition accuracy. On the other hand, several compressed domain feature are proposed for better computational efficiency.
Keywords/Search Tags:action recognition, dynamic texture recognition, local feature extraction, slow feature analysis, compressed domain
PDF Full Text Request
Related items