| Along with the increasing video data,video action recognition has become a challenging and attractive problem in the field of computer vision research.It can be used in intelligent video surveillance,face recognition access control system,automatic driving,robot vision guidance,etc.Therefore,the research of video action recognition has significant practical research significance and application value.Along with the progress of technology,the space required for video is also increasing.A ten-second image may consist of hundreds or even thousands of pictures,and the image has more temporal semantic information than a simple sequence of pictures.If the image target detection method is used directly for each frame of the video file,it not only ignores the spatiotemporal information of the video but also slows down the detection speed and makes it difficult to achieve the demand of real-time prediction.How to use the spatiotemporal context information provided by the video to improve the detection accuracy,speed and other performance as well as effectively use a small amount of data to build models has become the focus of video action recognition research.Based on the above analysis,multiscale action recognition and few shot action recognition algorithms are investigated in this thesis respectively,and the main contributions are as follows.(1)In this thesis,an action recognition algorithm based on multipath structure and gating mechanism is proposed to solve the problem that it is difficult for the behavior recognition model to establish the spatiotemporal feature model effectively.The method is based on a multipath structure network,and in the video feature extraction stage,multipath Res Net is used to obtain the basic spatiotemporal features,and the subsequent parallel processing of multilevel spatial and temporal features,using difference operators to filter background noise and multiscale temporal attention to extract temporal features,and using a gating module to adaptively generate fusion weight convolution parameters for dynamic fusion of spatiotemporal features,so as to effectively improve the accuracy.Experimental results show that the method achieves good results on publicly available datasets such as Kinetics-400,Something-Something V1 and Something-Something V2.(2)In this thesis,a few shot action recognition algorithm based on Transformer and misalignment strategy is proposed to solve the problem of requiring massive data for behavior recognition.The method uses Res Net with inflated convolution to obtain the before-and-after correlations of short-term temporal sequences,encodes the video in chunks using Transformer’s multi-head attention module,employs an effective yet simple multilevel matching strategy,subsequently models the overall temporal actions using global temporal alignment and local misalignment strategies,and projects the action videos into the metric space to measure the query set and support video similarity.Experimental results show that the method is effective in classifying video actions on the Kinetics-400,UCF101 and HMDB51 datasets. |