Font Size: a A A

Exploiting Spatio-Temporal Relationships For Video Action Recognition And Detection

Posted on:2022-05-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:D LiFull Text:PDF
GTID:1488306323462844Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Today's digital contents are inherently multimedia:text,audio,image,video and so on.Video in particular has become a new method of communication between Internet users with the proliferation of sensor-rich mobile devices.This trend has encouraged the development of advanced techniques for a broad range of video-understanding applica?tions.A fundamental issue that underlies the success of these technological advances is action recognition and detection.Nevertheless,the task is not trivial due to large varia-tions and complexities of video content.Most existing approaches focus on extracting action representations from the whole scenes with 2D or 3D CNNs.However,unlike objects which can be identified solely by their visual appearance,in many cases actions can not be identified by the visual appearance of actors or scenes alone.Rather,action understanding often requires reasoning about the actor's relationship with objects and other actors,both spatially and temporally.To achieve this goal,we study how to learn actor-centric spatio-temporal relation-ships for action recognition and detection in videos.This thesis starts from the classic neural network architectures for videos(e.g.,2D and 3D CNN),and studies how to de-vise and integrate novel structures,including attention neural cell,recurrent tubelet pro-posal module,long short-term relation framework and multi-scale sub-graph learning framework,to equip the network with strong ability to model spatio-temporal relation-ships in videos.In summary,this thesis makes the following contributions:(1)This thesis devises a general attention neural cell,called AttCell,that estimates the most distinctive spatial regions in each video segment.With AttCell,a unified Spatial Attention Networks(SAN)is proposed in the context of multiple modalities.Specifically,SAN extracts the feature map of one convolutional layer as the local de-scriptors on each modality and pools the extracted descriptors with the spatial attention measured by AttCell as a representation of each segment.Then,we concatenate the rep-resentation on each modality to seek a consensus on the temporal attention,a priori,to holistically fuse the combined representation of video segments to video representation for recognition.Extensive experiments are conducted on four public datasets,UCF101,CCV,THUMOS14and Sports-1M;our SAN consistently achieves superior results over several state-of-the-art techniques.More remarkably,we validate and demonstrate the effectiveness of our proposal when capitalizing on different number of modalities.(2)This thesis presents a novel deep architecture called Recurrent Tubelet Proposal and Recognition(RTPR)networks to incorporate temporal context across frames for ac-tion detection.The proposed RTPR consists of two correlated networks,i.e.,Recurrent Tubelet Proposal(RTP)networks and Recurrent Tubelet Recognition(RTR)network-s.The RTP initializes action proposals of the start frame through a Region Proposal Network on the feature map and then estimates the movements of proposals in the nex-t frame in a recurrent manner.The action proposals of different frames are linked to form the tubelet proposals.The RTR capitalizes on a multi-channel architecture,where in each channel,a tubelet proposal is fed into a Convolutional Neural Network(CNN)plus Long Short-Term Memory(LSTM)network to recurrently recognize action in the tubelet.We conduct extensive experiments on four benchmark datasets and demonstrate superior results of RTPR over state-of-the-art methods.More remarkably,we obtain mAP of 98.6%,81.3%,77.9%and 22.3%with gains of 2.9%,4.3%,0.7%and 3.9%over the best competitors on UCF-Sports,J-HMDB,UCF-101 and AVA,respectively.(3)This thesis presents Long Short-Term Relation Networks(LSTR)architecture,which models both short-term and long-term relation to boost video action detection.Particularly,we study the problem from the viewpoint of employing human-context relation within each video clip and leveraging supportive context from long-range tem-poral dynamics.To verify our claim,we utilize Tubelet Proposal Networks to generate 3D actor tubelets in all video clips.For each actor tubelet,LSTR dynamically pre-dicts the spatio-temporal attention map on the fly via adaptive convolution to indicate the essential context and measures human-context relation on the attention map.Such short-term relation is encoded into context feature to augment the feature of tubelet.Moreover,LSTR builds a graph on all the actor tubelets and capitalizes on Graph Con-volutions to propagate the long-term temporal relation over the graph and further enrich tubelet feature.Extensive experiments conducted on four benchmark datasets validate our proposal and analysis.More remarkably,we achieve new state-of-the-art perfor-mances on AVA dataset.(4)This thesis introduces a new design of sub-graphs to represent and encode the discriminative patterns of each action in the videos.Specifically,we present MUlti-scale Sub-graph LEarning(MUSLE)framework that novelly builds space-time graphs and clusters the graphs into compact sub-graphs on each scale with respect to the num-ber of nodes.Technically,MUSLE produces 3D bounding boxes,i.e.,tubelets,in each video clip,as graph nodes and takes dense connectivity as graph edges between tubelets.For each action category,we execute online clustering to decompose the graph into sub-graphs on each scale through learning Gaussian Mixture Layer and select the discrimi-native sub-graphs as action prototypes for recognition.Extensive experiments are con-ducted on both Something-Something V1&V2 and Kinetics-400 datasets,and superior results are reported when comparing to state-of-the-art methods.More remarkably,our MUSLE achieves to-date the best reported accuracy of 65.0%on Something-Something V2 validation set.
Keywords/Search Tags:Convolutional Neural Networks, Video Action Recognition, Video Action Detection, Spatio-Temporal Relationship
PDF Full Text Request
Related items