| Human action recognition is an important part of video analysis.At present,human action recognition methods only focus on visual modal information,that is,recognize actions by modeling spatio-temporal information in video.However,human beings are creatures that perceive the world with multiple senses,such as vision,hearing,smell and touch.There is a certain information deviation in single sensory cognition.Among them,visual and auditory senses are the main information channels for people to analyze and understand video.The integration of visual and auditory information in video analysis conforms to the natural law of human cognitive environment.How to effectively use visual and auditory information to realize high-performance audio-visual joint recognition method is one of the major challenges of video human motion recognition.The purpose of human motion recognition based on audio-visual combination is to improve the performance of motion recognition through the fusion and interaction of audio-visual modal information in video.This paper focuses on exploring an effective human motion recognition method of audio-visual joint learning.In the frame-level sequential action recognition task,in the human action classification stage,the current method only uses the visual modal information in the video,and lacks the introduction of auditory modal information and audio-visual fusion mechanism.This paper designs a classifier for audio-visual joint learning,which uses audio-visual fusion features to recognize human actions.Through comparative experiments on Alibaba cloud Tianchi data set,it is found that the fusion of audio-visual features greatly improves the accuracy of sequential action recognition compared with a single visual modal classifier,which preliminarily proves the superiority of audio-visual joint learning.In the segment-level audio-visual action recognition task,the existing methods lack the distillation of auditory information based on visual information.The background noise in audio interferes with action recognition,affects the spatial feature location of vocal objects in vision,and reduces the important information capture ability of the model.To solve this problem,this study proposes a dynamic vision guided auditory attention module and a past future dynamic vision extraction module,which distills the auditory information and improves the recognition performance of the algorithm by mining the content resonating with the visual dynamic information in the auditory mode.On the AVE audio-visual event dataset,the effectiveness of the innovation of this paper is verified in the tasks of fully supervision and weakly supervision.In the video-level audio-visual action recognition task,the existing methods lack the screening of the key time period of action.The background time period irrelevant to human action in the video interferes with the action recognition and brings redundant calculation.To solve this problem,this paper proposes a key frame filtering module,which filters the important information related to actions in the video in the time dimension,reduces the interference of background information on action recognition,and verifies the effectiveness of the key frame filtering module through experiments on the large human action dataset,Activitynet. |