Font Size: a A A

Video Action Recognition Based On Multi-Stream Network Architecture

Posted on:2024-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2568307139996399Subject:Engineering
Abstract/Summary:PDF Full Text Request
Video action recognition is a crucial research direction in the computer vision domain,with widespread applications in various fields such as video surveillance,intelligent medical care,intelligent transportation,and human-computer interaction.The crux of the video recognition task is to extract motion features from actions depicted in the video,while there are often background clutter and irrelevant background objects in the video,which may affect the recognition of the ideal behavior.In addition,when there are complex situations such as target occlusion,lighting changes and camera motion in the video,it affects the target feature information and thus impairs the recognition accuracy.To address these problems,this paper proposes two novel a video action recognition approaches based on deep learning,which are presented as follows.In this paper,a spatio-temporal target saliency-based multi-stream multiplier ResNets(STOMM-ResNets)is proposed for action recognition.the STOMM-ResNets model consists of three interactive streams: the appearance stream,motion stream and spatio-temporal target saliency stream.Similar to the traditional two-stream CNN model,the appearance stream and the motion stream are responsible for capturing appearance information and motion information,respectively,while the spatio-temporal target saliency stream is responsible for capturing spatio-temporal target saliency information.In addition,in order to effectively utilize the spatiotemporal interaction information between different streams,the model establishes an interactive connection between the three different streams,replacing the information fusion that is usually done in the final output layer.Two different multiplicative connections are injected,the first one is from the motion stream to the appearance stream,and the second one is from the spatiotemporal target saliency stream to the appearance stream.the STOMM-ResNets model is experimented on two standard video action recognition datasets,UCF101 and HMDB51,and the experimental results validate the effectiveness of the model.In this paper,we propose a novel spatio-temporal target saliency-based multi-stream ResNets-LSTM(STOM-LSTM)that combines three streams(i.e.spatial,temporal and spatiotemporal saliency streams)for video action recognition,which can capture foreground information and suppress background information of spatio-temporal objects in videos.In addition,to capture the temporal long-term dependencies between consecutive video frames,we use an attention-aware LSTM approach for action recognition based on spatio-temporal target saliency-based multi-stream ResNets.The STOM-LSTM was experimented on the UCF-101 and HMDB-51 datasets and compared with STOMM-ResNets and other models to achieve similar accuracy and better performance than STOMM-ResNets on the same dataset.The results show that the STOM-LSTM method model proposed in this paper has good performance.
Keywords/Search Tags:video action recognition, multiple streams, spatio-temporal target saliency, spatio-temporal interaction information, self-attentive mechanism long short-term memory network
PDF Full Text Request
Related items