Font Size: a A A

Deep Learning Based Human Action Recognition With RGB Videos

Posted on:2022-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:S C LiuFull Text:PDF
GTID:2518306314474404Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Human action recognition is a representative interdisciplinary task,which has been a popular research direction in the fields of computer vision,artificial intelligence and so on.It targets on analyzing and recognizing human actions from images and video data,and its research achievements have practical applications in virtual reality,security monitoring,human-computer interaction,multimedia content understanding,etc.In recent years,although deep learning has achieved huge success in the task of action recognition,challenges and difficulties remain.First,actions change slowly in a short time varies and consecutive frames are highly redundant.Hence,how to extract key frames in videos and distinguish action-relevant motion regions from cluttered backgrounds urgently needs to be solved.Second,action is represented by information in two dimensions of and spatial and temporal.Thus it is important to develop deep learning based algorithms for efficiently fusing the spatiotemporal information of actions.Besides,the design of end-to-end action recognition framework also has many difficulties.To address these issues,this thesis carries out a series of research on deep learning based human action recognition with RGB videos.The main work are listed as follows:(1)A spatial attention module and a temporal attention module are proposed,and based on these two modules an end to end framework for action recognition is designed.Temporal Attention Module is built by combining global average pooling and global max pooling to explore key frames in videos.Spatial Attention Module is established by fusing the value feature and the gradient feature of the feature map,making the representation of convolutional neural network for action recognition focus on the informative motion regions of actions.By injecting the spatial attention module and temporal attention module into existing convolutional neural networks,we obtain a novel end to end action recognition framework.Extensive experiments and comparisons with other methods demonstrate the effectiveness of our approaches.(2)A cross-modality attention based appearance-motion fusion network(AMFNet)is proposed,which can learn more efficient and robust action representation from RGB and optical flow data in an end-to-end manner.AMFNet is constructed by connecting a convolutional neural network with an appearance-motion fusion block(AMFB),whose goal is to incorporate appearance and motion information of RGB and optical flow data into a unified framework driven by a cross-modality attention(CMA)mechanism.CMA only relies on optical flow data,which consists of a key-frame adaptive selection module and an optical-flow-driven spatial attention module.The former aims to adaptively identify the most discriminative key frames from a sequence,while the latter is able to guide our networks to focus on the important action-relevant regions of each frame.In addition,we explore two schemes for appearance and motion fusion in AMFB.Extensive experiments and comparisons with other methods demonstrate the effectiveness of our approaches.
Keywords/Search Tags:Deep learning, Human action recognition, RGB videos, Attention mechanism, Spatiotemporal information fusion
PDF Full Text Request
Related items