Font Size: a A A

Research On Key Issues Of Pedestrian Perception And Analysis In Surveillance Video

Posted on:2020-03-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q JiangFull Text:PDF
GTID:1368330611455341Subject:Detection Technology and Automation
Abstract/Summary:PDF Full Text Request
Understanding and analyzing video content is one of the hot issues in the field of computer vision.It has broad application prospects in many fields such as public security,automatic driving and human-computer interaction.This paper focuses on the key issues of pedestrian perception and analysis in surveillance videos,and analyzes the behavior of individuals and crowd respectively.From the point of view of individual behavior,each pedestrian is separated from video sequences by moving object detection and target tracking,and then classified by action recognition model.As for group behavior,crowd density of video frames is estimated by crowd counting model.Based on this research ideas,this paper mainly studies four aspects: moving object detection,object tracking,action recognition and crowd counting.The main contents and innovations of this paper are summarized as follows:(1)An adaptive weight-sample-based object detection model is proposed.At present,the sample-based methods assume that each sample has the same importance so that it is easy to update valid samples incorrectly when updating the model,which leads to lower accuracy.To this end,this paper proposes to use variable weights to measure the importance of samples,and to use efficiency to evaluate the activity of samples,so that the model can simply and effectively identify valid samples.Besides,this paper proposes a new update strategy to adapt to the change of scene quickly.Firstly,a minimum weight updating strategy is proposed to avoid incorrect updating of valid samples.Secondly,reward-and-penalty weighting strategy is proposed to strengthen the weight of positive samples and punish other samples.Finally,a quantitative spatial-diffusion policy is proposed to reduce the influence of ghost and other noise.In addition,an adaptive feedback technology is introduced into the proposed algorithm to adapt to more challenging video sequences in the experiment.The final results show that the proposed method is superior to other state-of-the-art methods in CDNet dataset.(2)An object tracking model based on self-correlated representation is proposed.Sparse representation(SR)as a seminal model for visual tracking explores the relationship between all candidates and observed templates.This representation is unidirectional so that the model cannot detect and reduce the effects of noise when noise samples are updated into the template.To this end,this paper proposes an object tracking model based on self-correlated representation from the perspective of reducing internal noise and external disturbances.Firstly,we learn a low-dimensional subspace representation from highly correlated templates to model the object,which aims at eliminating the redundant information and reducing the influence of noisy templates.Then we represent the subspace by itself to learn the inner underlying features from subspace vectors.To further enhance models discriminating power,a new observation model is developed by considering both error distribution and large outliers.At last,experimental results show the effectiveness of the proposed tracking method,which achieves good tracking results on some challenging video sequences.(3)An action recognition method based on dual 3D convolutional network is proposed.Expensive computational cost and memory demand resulted from standard 3D CNNs hinder their applications in practical scenarios.In this work,we address the aforementioned limitations by proposing a novel dual 3D convolutional network including coarse branch and fine branch.The coarse branch maintains large temporal receptive field by a fast temporal downsampling strategy and approximates the expensive 3D convolutions using a combination of more efficient spatial convolutions and temporal convolutions.In addition,the fine branch progressively downsamples the video in the temporal domain and adopts 3D convolutional units with reduced channel capacities to capture multi-resolution spatio-temporal information.The key idea of our design is to avoid using the computationally expensive 3D convolutional network to process all spatial-temporal patterns but using it only for patterns which requires fine-grained spatialtemporal discrimination.Other patterns are expected to be handled by the coarse branch and thus the capacity demand of the 3D convolutional subnet can be reduced.Instead of learning the two branches independently,a shallow spatio-temporal downsampling module is shared for the two branches for efficient low-level feature learning.Besides,lateral connections are learned to effectively fuse the information from the two branches at multiple stages.The proposed network trained from scratch achieves competing performance on three challenging video datasets with network inference speed 4559 FPS on a single NVIDIA GTX 1080 Ti.(4)A crowd density estimation model based on mask-aware convolutional neural network is proposed.Most regression-based models directly regress Gaussian density map,which usually increases the learning cost of the network and reduces the accuracy of its prediction.In fact,the crowd counting problem is usually solved by estimating the density map generated by object location annotation.The values in density map,by nature,take two possible states: zero indicating no object around,a non-zero value indicating the existence of objects.To this end,we propose to use a dedicated network branch to predict the object/non-object mask and then combine its prediction with the input image to produce the density map.Our rationale is that the mask prediction could be better modeled as a binary segmentation problem and the difficulty of estimating the density could be reduced if the mask is known.A key to the proposed scheme is the strategy of incorporating the mask prediction into the density map estimator.So we study five possible solutions,and via analysis and experimental validation we identify the most effective one.Finally,we demonstrate the effectiveness of the proposed method,and show that our network could achieve the state-of-the-art performance on three public datasets.
Keywords/Search Tags:moving object detection, object tracking, action recognition, crowd counting
PDF Full Text Request
Related items