Action understanding is an important research direction in intelligent video surveillance,and has broad application prospects in fields such as intelligent security system,human-computer interaction,and intelligent sports analysis.Traditional action understanding mainly relies on manual viewing of videos,with low automation level,slow processing speed,and high cost,which is far from meeting the needs of large-scale video surveillance.On the other hand,with the development of deep learning and artificial intelligence technology,it is possible for computers to automatically process actions that occur in videos.Therefore,the action understanding technologies in intelligent video surveillance system has gradually become a research hotspot.This dissertation focuses on the action understanding in intelligent video surveillance system,and conducts research from three aspects: abnormal behavior detection,weakly supervised temporal action detection,and fine-grained action recognition in videos.This dissertation conducts in-depth research on the problems of existing methods in effective expression of normal behavior features,video temporal information modeling,and finegrained action structure understanding.The main contributions are as follows:(1)Aiming at the problem of extreme imbalance between normal and abnormal data in training samples,we design a video abnormal detection approach based on dual stream conditional generative adversarial network without the need for abnormal training video data by taking advantage of the characteristics of clear label and predictability of normal data.This method achieves the goal of accurately detecting abnormal targets by learning only the future behavior patterns of normal events.The dual stream structure design aims to detect motion and shape anomalies in videos from both motion and appearance perspectives.Experiments verify the effectiveness of the proposed approach.(2)Aiming at the problem of locating the transition state from background to foreground in the weakly supervised temporal action detection task,we propose a twostream graph convolutional network fusion for temporal action detection approach by combining the semantic information and the temporal relationship of the segments and using the advantage of graph convolutional network relationship modeling.Therein,a transition-aware temporal correlation graph is designed to prevent the transition state between the action and the background from being classified into action class.Meanwhile,the feature similarity of all the video segments is calculated to build the semantic similarity weighted graph,so as to further distinguish the action segments from the background segments in the feature space.The effectiveness of proposed method and the transition-aware temporal correlation graph is verified by experiments.(3)In order to solve the problem of missing temporal positioning caused by fixedsize video segments,a mask attention-guided graph convolution layer with the cooperation of local and global views for temporal action detection approach is proposed.Based on the design of two-branch network,this method combines the overall action distribution of video with the temporal context.Therein,mask attention aims to guide the separation of background segments and action segments in the feature space.The local perspective of this method is obtained by enhancing temporal context feature through segments temporal information,while the global perspective extracts the global region of interest from the perspective of the current segment guided by the distribution of the video overall action.Experiments has verified that this method can effectively improve the detection accuracy.(4)In view of the similarity between fine-grained actions in terms of appearance and motion pattern,a fine-grained action recognition approach based on transformer joint graph convolutional network is proposed by combining the cross-attention mechanism of Transformer and the feature aggregation advantage of graph convolutional network.This method transforms the task of classifying similar actions into answering non similar attributes by learning a set of query vectors about actions and their attributes.The graph convolutional network is designed to make up for the shortcomings of ignoring the correlation between video segments in Transformer when query vectors are learned by cross-attention.Experiments demonstrate the effectiveness of the proposed approach. |