Font Size: a A A

Compressed Video Classification Based On Two-stream Convolutional Neural Network

Posted on:2022-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y WuFull Text:PDF
GTID:2518306509454684Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Currently,traditional video classification methods are based on hand-crafted features,which achieved a relatively good performance in early tasks.But the traditional approach relies heavily on algorithms for extracted features and task-specific knowledge,so there has been a gradual transition to a deep learning-based approach.The traditional deep learning algorithm to solve this problem is based on a two-stream convolutional neural network structure.This structure achieves better performance in action recognition,which divides the network into spatial and temporal streams,using video frame images and dense optical streams in the network,respectively.However,this method has some drawbacks,i.e.,using dense optical flow as the input of the temporal stream,which is computationally expensive and extremely time-consuming for the current extraction algorithm and cannot meet the requirements of real-time tasks.In addition,the networks are trained independently and fused only in the final prediction score stage without considering the connection between the temporal and spatial streams.The details are as follows:(1)In this thesis,the optical flow is avoided in the temporal stream,and the Motion Vectors(MVs)extracted from the compressed domain are used as temporal features,which greatly reduces the extraction time.In addition,some popular algorithms also use MVs as features,but only the original MVs are used which results in lower accuracy.In this thesis,we propose a motion enhancement strategy based on the motion field accumulation of compressed video to improve its performance.And the temporal information of accumulated MVs is enhanced,i.e.,the motion information and continuity of the MVs are enhanced.(2)In this thesis,a fusion strategy based on different temporal resolutions is proposed,i.e.,different temporal resolutions are used for spatial and temporal streams to learn each stream feature more specifically.Besides,temporal stream features at different stages are fused into spatial stream features in the network structure to obtain more effective features.(3)The motion enhancement model based on spatial high-level features is proposed.The main idea of this strategy is based on knowledge distillation,which fuses the spatial information into the temporal stream so that more information is obtainable.Experimental results show that the accuracy of MVs can be greatly improved by the strategies proposed in this thesis and the final recognition accuracy is guaranteed without using optical flow.
Keywords/Search Tags:motion vectors, two streams, knowledge distillation, tempotal resolution
PDF Full Text Request
Related items