Font Size: a A A

Spatiotemporal Modeling For Video Human Action Recognition

Posted on:2022-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y H CaoFull Text:PDF
GTID:2518306605971839Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the development of multimedia,such as audio,text,image,video and other multimedia,the way of information expression is gradually changing to the perspective of human perception.Video representation is close to human perception,that makes the understanding of video content become a very meaningful work.In this paper,we focus on the task of human action recognition in video.Action recognition task has a wide application prospect in intelligent video surveillance,human-computer interaction,intelligent security and so on.In recent years,the methods based on deep learning for action recognition are rapidly developing.Here,we also study on the deep learning methods for the task.At present,the commonly used data modes are RGB image data and skeleton data.This thesis discusses the problem of video data modeling under different data modes.We decompose 3D video data into space dimension and time dimension,and design efficient methods combined with the existing modeling methods for two dimensions separately.Firstly,the spatial modeling based on skeleton data is introduced.We explore the spatial modeling method for the two-person interaction recognition subtask with skeleton mode.Skeleton data is matching to the task of behavior recognition because it is robust to the interference and focuses on the expression of human body.However,due to the small scale of skeleton data,extracting useful features from it requires more accurate feature location ability.Inspired by the cognitive system of human,this paper introduces prior knowledge into the two-person interaction recognition with skeleton data.The constructed spatial knowledge graph directly establishes the connection between the relevant spatial information and avoids the interference of hierarchical convolution expanding receptive field.Knowledge embedded graph convolution network is built,with the knowledge graph representing abstract spatial relation structure,and graph convolution obtain specific spatial features.Then,we design two graphs by exploiting the knowledge: prior knowledge connection and learning knowledge connection.Experiments show that the addition of prior knowledge can help the network pay attention to the important spatial features,and effectively improve the effect of two-person interaction recognition.Secondly,as for the spatial modeling of RGB data,the problem is a mainly research direction,due to the complex and volatile spatial information of RGB images.In this paper,the complementary of skeleton data and RGB images are used to design the attention guidance mechanism,utilizing the skeleton data to guide the feature learning process of RGB images.According to the high-level semantics guiding the low level semantics,the advanced semantics of skeleton features are used to help the lower level semantics of RGB features.As a result,the network can purposefully learn the useful spatial features for action recognition,and ignore irrelevant background and confused object features.Two exploratory methods are proposed in the implementation of the guidance mechanism.Experiments are conducted on the public dataset to verify their effectiveness.The experiment confirms that the skeleton attention guidance mechanism can effectively promote the learning of RGB spatial features,and make the network focus on the feature position which is conducive to action recognition.In other word,it shows that the proposed method can effectively utilize the complementarity of multimodal data.Finally,when modeling the time information,considering the more comprehensive representation of time information by skeleton data,the exploration of time modeling method is carried out on skeleton data.By analyzing the existing dynamic time feature extraction methods,the limitation of local convolution method in time dimension feature extraction is obvious.Therefore,we explore the feature extraction method which is suitable for time information.According to the characteristics of orderly and diverse combination of time dimensions,the non-local convolution is necessary.And the spatial graph convolution is used to realize the skipping time semantic extraction.Specifically,the global time relation structure is established by constructing time relevance graph.Through experiments and analysis,it is found that the non-local convolution for time can help the network selectively extract the core time dynamic information,and achieve better recognition effect.
Keywords/Search Tags:action recognition, graph convolution, multimodality, knowledge guidance, relevance
PDF Full Text Request
Related items