Font Size: a A A

Research On Video Event Recognition Using Deep Network Spatio-temporal Consistency

Posted on:2020-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y G LiFull Text:PDF
GTID:1488305777498144Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Video event recognition aims to recognize the spatio-temporal visual patterns of events from videos.Video event recognition has important application prospects in many fields,such as intelligent surveillance system,medical care,robot vision,etc,which is a hot research topic in the field of computer vision.Video data is characterized by large amount of data,complex sequence,low resolution,severe occlusion between moving objects,and tremendous intra-class variations,resulting in that video event recognition is a very challenging task.In recent years,deep learning technique vigorously promotes the development of computer vision,and visual representation based on deep learning achieves remarkable performance in the field of video event recognition.Since video contains abundant spatial and temporal information,spatio-temporal co-modeling is an important foundation of video analysis technology.From the viewpoint of deep network spatio-temporal consistency,this paper discusses some significant problems in video event recognition,such as inconsistency of spatio-temporal representation,insufficient ability of network to learn global video features in the case of complex background or video sequence,deficiency in capturing event details in severe occlusion,and so on.The main works are summarized as follows:(1)For the problem of inconsistency of deep spatio-temporal representation in video event recognition with complex background,a video event recognition method using Convolutional Neural Networks(CNN)spatio-temporal feature maps consistency is proposed,which includes two levels:local and global spatial-temporal consistency.On the basis of visualizing the convolutional layers of two-stream networks and observing the evolution of convolutional features,a peer-to-peer pooling algorithm between spatial CNN feature maps and their temporal counterparts is devised,which is called Maximal Region Growing Pooling(MRGP),then a local spatio-temporal consistency layer is obtained.Furthermore,trajectory-constrained pooling is conducted to deep features to combine the merits of deep feature and hand-crafted feature.Based on the two-stream network branches and spatio-temporal consistency layer,a triple-channel model is obtained,on which yields the final recognition result by global spatio-temporal consistency fusion.The experiments on two benchmark surveillance video datasets including VIR AT 1.0 and VIRAT 2.0 manifest that the proposed method can achieve superior performance compared with the state-of-the-art methods on these event benchmarks.(2)Aiming at solving insufficient ability of network in learning global video features under the case of complex scene or video sequence,a video event recognition method based on deep residual recurrent network spatio-temporal consistency is proposed,focusing on the design of residual model and optimization function.In the process of construction of residual model,spatio-temporal feature concatenated layer is designed firstly,in which the deep features are transformed by spatial LSTM and temporal LSTM respectively,and conjugated as a unit to form a spatio-temporal consistency input structure.Then multiple spatio-temporal feature concatenated layers are added to an identity mapping to build up a residual block.At last,stacked residual blocks construct the deep spatio-temporal holistic feature descriptor,termed Deep Residual Dual Unidirectional Double-LSTM(DRDU-DLSTM),which can improve the global features learning ability of video events.To further optimize the recognition results,2C-softmax objective function is devised based on two-center loss,which will minimize the intra-class variations while keep the features of different classes separable.Experiments on VIRAT 1.0 Ground Dataset and VIRAT 2.0 Ground Dataset demonstrate that the proposed method has good performance and stability,which can achieve superior performance.(3)To address the problem of deficiency in capturing event details in severe occlusion,a spatio-temporal deep residual network model with hierarchical attention for video event recognition is built in conjunction with studies based on intra-frame attention mechanism and inter-frame long short dependency relationship.In intra-frame attention mechanism,a hierarchical attention model is proposed,which includes three layers.In the first layer,object-based attention(O-attention)guided by visual semantics attends the objects in the event region.In the second layer,holistic attention(H-attention)guided by overall perspective semantics and O-attention perceives more details of occluded objects and global background.In the third layer,attention based features are fused,in which two attention enhanced features,representing global and local information,are joined and input into a deep recursive network.To dig event information from the recursive network,two strategies are devised.One is the acquisition of inter-frame long short dependency relationship of appearance information,the other is the description of long short-term features of motion information.The two strategies construct a spatio-temporal architecture.Experiments on CCV,VIRAT 1.0 and VIRAT 2.0 three video datasets demonstrate that the proposed method does well in capturing event details in severe occlusion and achieves the state-of-the-art performance.
Keywords/Search Tags:Video event recognition, Spatio-temporal consistency, Convolutional Neural Networks(CNN), Recurrent neural networks, Hierarchical attention
PDF Full Text Request
Related items