Font Size: a A A

Research And Implementation Of Fast Frequency-domain Compressed Video Action Recognition

Posted on:2024-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:L XiongFull Text:PDF
GTID:2568306944457824Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Human action recognition is an essential task in computer vision,which refers to the process that the computer gets the action category of videos by processing and analyzing the video data.Human action recognition plays an important role in intelligent monitoring,human-computer interaction,video retrieval and other fields.In recent years,action recognition has received extensive research and attention,but it also faces critical challenges.The efficiency problem and the extraction of significant spatio-temporal clues are important challenges in action recognition task.Most existing action recognition methods need to completely decode the video into RGB frames,which has the problems of high spatio-temporal redundance and high computing complexity.Recently,compressed domain action recognition methods have also attracted lots of attention.They utilize a small number of complete decoded frames as well as motion vectors and residual data for action recognition,reducing the video redundancy to an extent,but still taking up long video decompression time.In addition,to capitalize on the rich patterns in frequency-domain,some methods use frequency data directly for action recognition,reaching a higher recognition efficiency.However,they do not make full use of the rich semantics in the frequency domain and ignore the complementary of the frequency and spatial domain,resulting in the degradation of recognition accuracy.To alleviate these problems,this paper focus on efficient action recognition in compressed domain based on frequency-domain data.The main work is as follows:(1)A faster frequency-domain compressed video action recognition framework is proposed.To address the problems of redundance in the video frames and high computing complexity,this paper proposes a faster frequency-domain compressed video action recognition framework(Faster-FCoViAR).It obtains frequency-domain data of compressed videos directly by a novel frequency-domain partial decompression method,to reduce the video pre-processing time.Then,it down-samples the frequency-domain channels which contain the pivotal spatio-temporal clues,to enhance the saliency of input and reduce the data redundancy.Finally,it utilizes the knowledge distillation by a spatial-to-frequency-domain student-teacher network,to learn the complementary semantics of the spatial and frequency domain.Experiments on UCF-101,HMDB-51,and Kinetics-400 datasets show that the proposed method achieves higher recognition speed compared with RGB-based methods and other compressed-domain methods,with competitive recognition accuracy.(2)A frequency enhancement network for efficient compressed video action recognition is proposed.The frequency-domain learning methods faces the loss of low-frequency texture and edge information,leading to the problem that the object and scene features related to actions cannot be extracted effectively.To solve these problems,this paper proposes a frequency enhancement(FE)block,which includes a temporal-channel two-heads attention(TCTHA)module and a frequency overlapping group convolution(FOGC)module.First,the TCTHA module emphasizes the inter-frame temporal context and the inner-frame informative frequency semantics by attention mechanism.Then,the FOGC module groups channels in different frequency bands with overlap,to emphasize the low-frequency spatio-temporal texture and edge clues,while maintaining the interaction of groups.Finally,integrate the FE block into 2D-CNNs with frequency I-frame input to get the frequency enhancement network(FENet)for efficient compressed video action recognition.Experiments on UCF-101,Kinetics-400,and Kinetics-700 verify that FENet can extract the pivotal low-frequency spatio-temporal texture and edge clues for action recognition,achieving comparable accuracy compared with RGB-based methods with high efficiency.(3)A frequency-spatial domain CNN-Transformer two-stream action recognition network for compressed videos is proposed.To address the problems that the RGB-based Transformer methods suffer from the lack of local details,which results in the accuracy degradation of subtle and local actions,this paper proposes a frequency-spatial domain CNN-Transformer two-stream action recognition network(FSConformer)for compressed video action recognition.FSConformer takes both frequency and compressed domain I-frames as input.It utilizes a frequency-domain spatio-temporal decoder(FDecoder)and a frequency-spatial domain attentive token fusion(FSATFusion)to integrate local semantics in frequency domain captured by CNN and global semantics in spatial domain captured by Transformer,enabling the network to learn complementary information in frequency and spatial domain.It improves the capability of capturing the tiny and local action clues by the local details in the frequency domain.Experiments on UCF-101,Kinetics-400 and Kinetics-700 datasets reveal that the proposed method reaches higher recognition accuracy compared with other compressed domain methods,and achieves competitive accuracy compared with other RGB-based Transformer methods,with higher inference speed.(4)A compressed-domain action recognition system is designed and implemented.Based on the frequency enhancement network(FENet)proposed in this paper,a compressed-domain action recognition system is implemented by using FFmpeg,C,Python,and other technologies,including online and offline modes for action recognition.In the online mode,the frequency-domain data is obtained by the camera in a real-time way,while in the offline mode,the frequency-domain data is obtained by the frequency-domain partial decoding method.Then,fed the frequency-domain data into the FENet to get the current action category and other related information.In summary,this paper carries out research based on frequency-domain data of compressed video.First,aiming at the problems of information redundancy and low recognition efficiency in video action recognition,a faster frequency-domain compressed video action recognition framework is proposed,which directly utilizes frequency-domain data for action recognition.Then,by further analyzing the distribution of frequency-domain data,A frequency enhancement network for efficient compressed video action recognition is proposed,to preserve discriminative low-frequency spatio-temporal texture and edge clues which are related with video actions.Finally,to address the lack of local details in the RGB-based Transformer methods,which results in the accuracy degradation of subtle and local actions,a frequency-spatial domain CNN-Transformer two-stream action recognition network is proposed,capturing the complementary semantics in frequency and spatial domain simultaneously and improving the ability of the network to recognize tiny and local actions.Furthermore,a compressed-domain action recognition system is implemented,which shows the practicability and effectiveness of the proposed method.
Keywords/Search Tags:action recognition, video classification, compressed video action recognition, frequency domain
PDF Full Text Request
Related items