Font Size: a A A

Research Of Video Action Recognition Based On Spatial-Temporal Feature

Posted on:2024-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y H ZhangFull Text:PDF
GTID:2568307079954809Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of network communication technology,video application floods into the market,and a large amount of video data is uploaded at any time,requiring timely review and supervision.As an important research field of artificial intelligence,action recognition for videos has developed relying on deep learning algorithms.Videos contains both spatial and temporal features.Common researches rely on optical flow or three-dimensional convolution to extract spatio-temporal features,which require a complex model structure and high computational costs.Among them,difference modeling stands out,which improves the speed of the model while ensuring the recognition accuracy.This thesis designs a video action recognition method based on spatio-temporal features,proposes a new difference extraction method to extract local spatio-temporal features on each segment of the video.Then in the global stage,action rhythm feature is extracted with local feature difference.Considering user privacy and data security,the model proposed in this thesis is integrated with federated learning,and a new method of personalization federated learning is proposed according to the characteristics of our model,which improves the training effect under federated learning.The main contributions of this thesis are as follows:(1)For the local spatio-temporal features of video segmentation,a new extraction method of difference information is proposed,which extracts differential features centered on single RGB frame,and designs a spatio-temporal module based on local information.In video segmentation,the spatial feature is extracted from a single RGB frame randomly sampled,and the sampled frame is the main feature source of the video segmentation.Therefore,the difference extraction method in this thesis takes two frames before and after the sampled frame,and performs difference operations with the sampled frame,after smoothing the features in the channel dimension,the supplementary temporal features based on the sampled frame information are obtained,and then fused with the spatial features to obtain the spatio-temporal features,which improves the local feature extraction effect.(2)For the global temporal features of the whole video,this thesis proposes to extract action rhythm features using difference operation,and designs a spatial module based on global information.In the global module,different strides are used to obtain bidirectional difference features to represent different action rhythms,then fused with the original local features through the attention mechanism to obtain the final global spatio-temporal features,and deploy the local module and global module into the Res Net structure.In the end,the model proposed in this thesis achieved an accuracy rate of 97.6% on the UCF-101 dataset.(3)For video recognition with federated learning,a personalized federated learning method specialized for our model is proposed.Considering the local features of video have a strong correlation with data samples,they are suitable as user’s local private features,while the global features pay more attention to dynamic temporal information and can be extracted using public parameters.This thesis divides the model layers into personal stage and public stage based on the characteristic of model,and proposes a new personalized federated learning training strategy to improve the model training effect.Finally,on the non-independent and identically distributed UCF-101 dataset,our model achieves average accuracy of 97.792%.
Keywords/Search Tags:Deep Learning, Difference Information, Video Action Recognition, Spatial-temporal Feature, Federated Learning
PDF Full Text Request
Related items