Font Size: a A A

Research On Conjunct Static-Dynamic Efficient Method Of Action Recognition

Posted on:2022-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y X BanFull Text:PDF
GTID:2518306773490554Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Action recognition is a hot research direction of computer vision.Due to the massive application and all kinds of internet short video applications,related business has gradually entered any corner of commercial economy era.The vision technology that based on it has gained prosperous application prospects.Action recognition technology,as a key technology for automatic marking,analysis and prediction of target short video,has gained extraordinary achievement in various computer vision conferences in recent years.However,three-dimensional video data has another time dimension compare to two-dimensional image data.It causes much more pressure to the inference and storage consumption on edge device.Therefore,how to compress the existing Action recognition algorithm has become an urgent demand in reality.From the perspective of effectively reducing computational consumption while retaining considerable performance,the current efficient action recognition algorithm involves the following three related problems:(1)For most popular static compression methods,due to its unified processing process for all data,there is a certain degree of calculation waste for both too simple and too difficult data.(2)Different from the traditional teacher-student distillation model,online distillation proposed in recent years.Each sub-network is both a teacher and a student network which obtained better results than traditional distillation.Generally,such online distilla-tions tend to be based on sub-networks of the same depth.This work trying to discuss whether the sub-networks with different computational depth have better distillation effect for online distillation,so as to obtain better model compression performance.(3)The prediction of most action recognition algorithms are based on the weighted mean of the inferred results of the whole video data,which is usually inferred by randomly selecting some videos clips and averaging the results? Or by inferring all the video data,average the inferring results of each segment.This approach leads to a significant increase in the computation burden and inference time of video Action recognition algorithm.At the same time,because some irrelevant actions in the video will interfere with the final inference result of the video,this work trying to explore a more accurate and less computational solution.Therefore,the main contributions of this paper are as follows:For the waste of resources caused by static methods mentioned in Question(1),this work introduces dynamic methods and proposes an efficient action recognition algorithm framework combining dynamic and static methods.This framework can effectively combine the advantages of dynamic and static methods to obtain a more general and adaptable action recognition framework.For the problem of unitary complexity and convergence of online distillation sub-network described in question(2),this work extends the latest model compression method RNNPool of 2D image recognition domain to the field of video action recognition.Technology of 3D-RNNPool has designed to efficiently process video data.By using 3D-RNNPool technology,this work cleverly proposed an efficient multi-scale mutual distillation network based on 3D-RNNPool down sampling technique.Each sub-model realizes unique computational complexity that can share knowledge with each other through self-attention mutual distillation more easily.Hence,the diversity between branches is increased.According to the experimental comparison,it was found that com-paring with the sub-network mutual distillation with the same sub-network complexity,stronger diversity in this paper improved the posterior entropy and reduced the overfit in the process of diversified sub-network distillation.Better distillation performance are achieved while effectively reducing the computational burden.For the problem of redundant interference video clips in the action recognition model described in question 3: This work study the importance of clips compare to the whole video by means of a double gate control mechanism network.By analyzing the inferential data in advance,the decisive action regions with high importance were analyzed,and the auxiliary or irrelevant action clips in the video were ignored.At same time,ignoring the auxiliary and irrelevant action clips in the videos.The experimental results show that the important clip learning network can greatly reduce the average inference pressure per video,which proves the effectiveness and feasibility of pruning from the data end.Finally,the gumbel-Softmax mechanism was introduced to construct another branch selection network to further reduce the computational budget.Finally,KEDNet,a key segment efficient dynamic network integrating the three models,achieved a relatively competitive performance compared with state-of-the-art model in the field of efficient action recognition.And also it significantly lower the limit of computational consume for3 D approaches in this domain.
Keywords/Search Tags:Computer vision, Action Recognition, Model Compression, Knowledge Distillation, Dynamic Inference, Deep Learning
PDF Full Text Request
Related items