| The spatiotemporal behavior detection technology has important research significance in the field of computer vision.Its main task is to identify specific behavior instances on a video clip,determine the start and end time of the action in time sequence,and determine the spatial position of the action in each frame.In order to meet the real-time and generalization of actual application scenarios,this thesis selects a model architecture based on lightweight networks.Since future frames cannot be obtained during online detection and the feature extraction ability of lightweight networks is limited,and different Edge device have low computing power and high real-time requirements,in order to meet the requirements of practical applications and improve the generalization of spatio-temporal behavior detection networks,This article has made feature enhancement and network acceleration improvements to the network,and the main content is as follows:(1)Feature enhancement part: In response to the insufficient feature extraction ability of lightweight networks during online detection,this article aims to enable the network to obtain and utilize richer feature information from four aspects.Firstly,in order to obtain more feature information of different scale sizes,this article introduces a feature pyramid structure,which parallelly detects feature maps at different levels to improve the recognition accuracy of targets of different scale sizes; Secondly,in order to make the feature information extracted by the network more accurate,the network regression loss function is selected as the SIo U Loss with better effect,so as to improve the accuracy of motion location.Afterwards,in order to eliminate interference between different feature information,a single detector head coupled with classification and regression features was changed to a decoupled detector head for parallel detection,thereby improving the accuracy of classification and localization simultaneously.Finally,in order to fuse 2D spatial features and 3D spatiotemporal features more fully,the attention mechanism of CFAM is improved by introducing a multi head self attention mechanism with three learnable Q,K,and V weight matrices,adding more learning parameters to improve recognition accuracy.(2)Network acceleration part: In response to the complexity and poor real-time performance of network operations during online detection,this article aims to reduce computational complexity and improve recognition speed from three aspects.First,in order to reduce the computation of 3D backbone network,select a more efficient backbone network to significantly reduce the amount of network parameters,thus improving the recognition speed; Secondly,in order to avoid redundant feature information extraction,an interaction layer module is added between backbone networks to enable the network to establish a connection in the feature extraction phase,and key frames with more abundant spatial information acquired by 2D backbone networks are integrated into 3D backbone networks to avoid redundant extraction of high overlap information,thus improving the recognition accuracy and accelerating the recognition speed.Finally,in order to reduce the time spent on the prediction layer,anchor based is improved to anchor free,greatly reducing the number of anchor boxes.At the same time,to solve the problem of high computational complexity and ignoring high-quality prediction boxes,single positive sample label allocation is improved to dynamic label allocation,expanding the number of positive samples from a global optimal perspective,improving recognition accuracy and accelerating recognition speed.Finally,the effectiveness of the improvement in this thesis was demonstrated through experiments.The improved network improved F-m AP by 6.1% and FPS by 12 on the UCF101-24 dataset,resulting in improved recognition accuracy and speed. |