Since the State Council officially released the “The Development Plan for the New Generation of Artificial Intelligence”,the artificial intelligence technology of China has entered a new stage of development.With the strong support of the government,new artificial intelligence technologies and industries such as intelligent security,intelligent medicine,intelligent education,self-driving and intelligent manufacturing are developing rapidly and making great progress,which has brought great convenience to people’s lives.The 14 th five-year plan proposes to "Accelerate Digital Development and Build a Digital China",emphasizing the demands for activating the potential of data elements,driving the digital transformation of traditional industries,and creating a new development concept of digital economy.People are the main body of social life.Building the intelligent algorithm models for analyzing and understanding human activities is a committed step for inventing various kinds of intelligent applications,developing digital economy and building the intelligent social ecosystem.Human activity recognition and understanding technologies are research hotspots in the field of computer vision.They are the bases of many image and video analysis tasks,and has important theoretical research significance.Under the background of constructing “Digital China”,their practical significances are becoming more and more prominent.Extracting specific human activity patterns from various images and videos,and understanding them is the core task of human activity recognition and understanding methods based on computer vision.Currently,with the development of computer vision and deep learning technology,researchers have proposed a large number of human activity understanding algorithms based on Convolutional Neural Network(CNN),Graph Convolutional Network(GCN)and Recurrent Neural Network(RNN).Because of their flexible architecture designs and good performances,these methods have been widely recognized and applied.In the meanwhile,owing to the fact that visual sensors of different types are deployed in various working environments,the images and videos are usually recorded with complex changes of background,illumination and viewpoint,which lead to still an important challenge to extract and understand human activity characteristics in real scenes.Therefore,based on computer vision and deep learning technology,aiming at the problem of human activity understanding in complex real scenes,this dissertation concentrates on studying and constructing neural network architectures with high accuracy,robustness and practicality.The main works of this dissertation are summarized as follows:Aiming at the problems of low accuracy,high computational complexity and weak interference immunity of existing human activity recognition methods,this paper constructs Two-Stream Residual Spatial-Temporal Attention Network(2S-RSTAN).Two-stream network architecture has the ability to extract spatiotemporal features from video,and has been widely used in human activity understanding tasks.However,there is a large amount of redundant information in both time and space dimensions in videos,which increases the difficulty of network learning and limits the network performance.In order to solve this problem,it is proposed to adopt a sparse sampling strategy to reduce the video frames at first.And then,integrate residual learning and spatiotemporal attention mechanism in the two-stream architecture to learn human activity features in videos,so that the network can focus more on meaningful spatiotemporal features.In the Two-Stream Residual Spatial-Temporal Attention Network,RGB images and RGB difference images are adopted as input for each stream respectively.And each stream is composed of residual spatial-temporal attention modules,which enable the network to generate attention-aware features in time and space dimensions,and largely reduces the negative interference caused by the redundant information.Combined with the inherent characteristics of residual learning,a deep enough network can be built to learn the spatiotemporal information in the videos sufficiently.With the layers going deeper,the residual spatial-temporal attention blocks can adaptively generate the attention-aware features in different depths.The experimental results show that the proposed network architecture can achieve good performance in human activity recognition.Moreover,compared with3 D network,the network is more lightweight and can meet the requirements of real scenarios.Existing human activity recognition and understanding networks have characteristics of poor interpretability and low robustness,and have blindness in model learning.Aiming at the above problems,this paper constructs Body Part Relation Reasoning Network(BPRRN).Body part features play an important role in human activities.And there are certain correlations between different body parts,which are of good significance for reasoning human activities.In view of this,it is proposed to construct BPRRN from the perspective of relation reasoning.The human body is divided into ten parts to make the network focus on the learning of these regional characteristics.Meanwhile,the body part relation reasoning module is constructed to explore the potential relations between different parts in human bodies with different activities.Then the obtained relation information is concatenated with the whole human body features and scene features to infer the human activity in the image.In the time domain,the temporal relationship reasoning module is constructed to model and explore the temporal relation features between adjacent video frames,and infer the human activity at the video level.The experimental results show that the proposed body part relation reasoning module and temporal relation reasoning module can improve the human activity understanding ability of the network.In addition,it can be seen from the visualized experimental results that,the local activity characteristics provide a good interpretability for the reasoning results of the whole body.Existing human activity recognition and understanding networks passively perceive the action patterns in images and videos during the training process.They lack the active cognitive ability of subjective reasoning and judging with prior knowledge.In view of this,this paper constructs Human Activity Knowledge Transfer Network(HAKTN).When lacking enough training data,existing neural networks have difficulty receiving sufficient valuable information,which results in poor performance.However,human beings can use their prior knowledge to understand the information contained in the scene through subjective judgment.Inspired by this,a Human Activity Knowledge Transfer Network is proposed,where prior knowledge is integrated into the training process,so that to equip the network with human’s cognitive and perceptual abilities.By studying the co-occurrence relationship between body part actions and objects in the images,the co-occurrence probability knowledge matrix is constructed,and then utilized to guide the network to extract part action features.According to the physiological structure characteristics of human body,the human skeleton knowledge matrix is constructed,and based on this,the part action features are modeled to infer human activities.Compared with the existing deep learningbased human activity understanding methods,the proposed network realizes the unification of subjective cognitive ability and passive cognitive ability.The effectiveness of this method is proved by numerical experiments,and the experimental results show that the proposed network can work well based on both normal data and small sample data,which further proved the high robustness of the network. |