| With the development of computer technology,artificial intelligence gradually plays an increasingly important role in our daily life.As an important branch of artificial intelligence,computer vision aims to make machines react to things as humans do.Coincides with the era of the explosion of video,a variety of platforms and devices is producing a large amount of video data all the time,and these data can not be completely relied on human to process.Therefore,intelligent algorithms for video understanding are needed to improve efficiency.As the main part of video content,human is the crux of video analysis technology.And accurate analysis of human actions is the key to video analysis and understanding.In real application,not only the categories of video actions is needed,but also the starting and ending points of the action instances need to be obtained,which makes video action recognition and localization become the hot spot in this field.Since the depth information can show the context of different parts of the human body and present more action details,this paper will mainly study the action recognition with RGBD data as input.In addition,the weakly supervised action localization,which does not need detailed annotation of action boundary,can greatly reduce consumption and has better practical value.Therefore,another research content of this paper is the weakly supervised action localization.Although prior works have put forward many methods for these two tasks,there are still many challenges and difficulties needed to be resolved.This paper studies these problems and the main work is as follows:(1)Firstly,in view of the problems that the extraction of temporal features of depth information is ignored in most RGBD action recognition methods and the relationship between different modalities is not extracted and fused by the end-to-end way,this paper proposes a novel two-stream network with 3D common-specific framework for RGBD action recognition.In this approach,TSN network is the base model.And RGB video frames and depth video frames are used as inputs.Moreover,four 3D convolutional blocks with non-shared parameters are used to extract common-specific features.Finally,similarity loss,dissimilarity loss and classification loss were trained jointly to optimize network parameters.We have conducted a large number of experiments on three RGBD action recognition datasets,and the results demonstrate the effectiveness of our approach.(2)Secondly,aiming at the two challenges of integrity and separability in the field of weakly supervised action localization,we study the basic idea of feature erasing and propose a novel deep snippet selective network for weakly supervised action localization.There are four branches in this approach.An attention branch is used to generate class-agnostic attention scores to enhance the classification accuracy.Two erasing branches provide the network with a priori knowledge of the background while erasing the most discriminative features.A background suppress branch further suppresses the activation of background features.Through the cooperation of these branches,our method realizes the integrity and separability of weakly supervised action localization.A large number of experiments on two widely used datasets demonstrate the effectiveness of the proposed method. |