The development of technology and the widespread application of artificial intelligence have led to the explosive growth of multimedia information,and video,as an information carrier,is playing an increasingly important role.A large amount of video content involves human behavior and activities,making human behavior recognition based on video sequences an important research direction.It has broad application prospects in intelligent video monitoring,virtual analysis,perceptual interfaces,motion analysis,and other fields.However,traditional manual detection and recognition methods require a lot of manpower,high repeatability,and low efficiency.Using computer algorithms to automatically extract valuable information from videos has greatly improved work efficiency and liberated human resources.However,the current human motion recognition technology based on video sequences still faces many challenges,such as how to improve recognition accuracy without requiring a large amount of data set preprocessing,how to distinguish small changes in space and time of similar motion patterns,and processing fine-grained information.Therefore,this article proposes key technologies for improving model structure and frame retrieval strategies.The main work and innovative points of this article are as follows:(1)This article proposes an action recognition model based on adaptive multi frequency domain self-attention cross fusion,which utilizes information from multiple frequency domains and improves recognition efficiency through an adaptive frequency selection algorithm.This algorithm consists of two branches,which are used to extract motion information and spatiotemporal information.In the extraction of motion information,strong motion regions in video sequences are extracted through motion paths,and fine-grained motion characteristics are learned.In the extraction of spatiotemporal information,average pooling and bilinear downsampling operations are performed on consecutive frames of the input video to extract static information.And the two branches are fused,and multi frequency domain information is added during the fusion process.The adaptive frequency domain self attention cross fusion module is used to complete the information fusion and improve recognition performance.(2)At the same time,this paper also proposes two different algorithms to model motion information in video to reduce computational costs and improve fine grained recognition.The first method is to first perform a dimensionality reduction operation on the image to reduce the amount of computation,then extract video frames through the SSIM algorithm,and dynamically select frames that contain more motion information to send to the network.The second method is to perform a frame difference operation on consecutive images in the video,obtain the difference,and use the Softmax function and entropy to quantify the information in the difference image.The higher the entropy value,the stronger the motion information in the image,and dynamically select the image frame based on the size of the entropy value.Finally,the branch results are summarized and used for the final identification classification.This article has conducted experiments on HMDB51 dataset and Something-Something-vl dataset,and achieved good recognition results. |