| Video behavior recognition technology is widely used in many fields such as intelligent security and video retrieval,and is an important task for video understanding.Although excellent results have been achieved after applying deep learning to the field of video behavior recognition,there is still potential for further improvements in video behavior recognition technology.Firstly,the different behaviors in the video have different time span and uneven distribution,which makes it difficult to extract the spatial and temporal information in the video efficiently.In addition,the temporal structure in the video is complex and diverse,and how to effectively perform long-term temporal modeling is still a difficult problem in video behavior recognition.Aiming at the above problems,this thesis researches the video behavior recognition technology based on spatio-temporal modeling,and the main contents are as follows:(1)Aiming at the problem that it is difficult to extract spatial and temporal information in videos efficiently because different behaviors in videos have different lengths of time span and uneven distribution,this thesis proposes a spatio-temporal modeling behavior recognition algorithm based on behavior key frame sampling and temporal difference.The algorithm uses Res Net50 as the backbone network and adopts the behavior key frame sampling method to sample the video,first calculating the probability distribution of the amount of motion information in the video frames,then grouping the video frames by using the grouping strategy of evenly dividing the amount of motion information,and finally randomly sampling the video frames from each group,so as to achieve adaptive sampling of behavior key frames for different videos.The algorithm also proposes a temporal difference module,which obtains the temporal difference map by making the difference between the sampled frame and its two preceding and following frames,extracts the spatial features from the sampled frame to realize the spatial information modeling,extracts the temporal features from the temporal difference map to realize the temporal information modeling,and finally fuses the extracted spatial features with the temporal features to realize the temporal information modeling in the video.(2)Aiming at the problem that the complex temporal structure in video makes it difficult to effectively model long-term temporal information,this thesis further research on the already constructed temporal modeling behavior recognition algorithm and proposes a temporal modeling behavior recognition algorithm incorporating temporal adaptive motion excitation and attention mechanism.The algorithm proposes a temporal adaptive motion excitation module,which first enhances the motion channel and suppresses useless background information by computing the feature-level temporal differences between video clips,then generates video-dependent temporal adaptive convolution kernels based on the long-term temporal information of the video,and finally uses convolution to aggregate the long-term temporal information in the video to achieve long-term temporal modeling.In addition,the algorithm also incorporates a coordinate attention mechanism that encodes spatial coordinate information while constructing channel attention,enabling the network to precisely locate and enhance features related to behavior recognition,thus further enhancing the algorithm performance.In this thesis,the spatio-temporal modeling algorithm incorporating temporal adaptive motion excitation and attention mechanism is validated on the temporal-related Something-Something V1 dataset and the scene-related HMDB51 dataset,and the obtained behavior recognition accuracies are 51.2% and 73.5%,respectively.The experimental results show that the algorithm proposed in this thesis can effectively improve the accuracy of behavior recognition. |