| Recognition of human actions in video has always been the focus of research in the field of machine learning.It can be widely used in many fields of daily life,including automatic driving,smart home,game interaction,video review,safety,sports training,etc.,and has a positive impact.But,due to the complexity of data in video,video action recognition is a particularly challenging problem.Different viewing conditions,different viewing angles,noise content irrelevant to the main body of video actions,complex changes and timing structure contained in some actions will increase the difficulty of network model learning.This dissertation focuses on the problem that the task of motion recognition in video is easily affected by subject saliency and class similarity,and the automation of data set construction is low.The following research results have been achieved:1.In this dissertation,an attention and micro attention branch structure applied to deep 3D convolution network is proposed to solve the problem of decline in the accuracy of network using video image to recognize action when the action subject is not significant in video picture.The attention branch is composed of 1*1*1 convolution kernel and maximum pooling layer.This branch can be flexibly added to the existing 3D convolution network in the form of a plug-in,and the overall architecture of the original network remains unchanged.The network integrating attention branches can fuse the attention features extracted by attention branches in real time in the process of feature extraction,so as to improve the focusing ability of the network for action subjects.Compared with attention branching,micro attention branching is mainly used in network modules with multiple sub branches,so as to improve the directional fusion ability of attention features.The experimental results show that the recognition accuracy of the network constructed by micro attention branch is improved by 3.6%compared with the original network,while the scale of parameters to be trained is only increased by 0.6%.2.This dissertation proposes a two-stream neural network using sound to assist action recognition,which mainly solves the problem that the network using video frame as input can not make accurate judgment when the action subject does not appear in the video picture or in the significant position of the video frame for a short time.Firstly,the sound texture features are obtained by mathematical statistics of the sound in the video by imitating the processing process of the human brain for the external sound,and a network with the sound texture as the input is designed.Then,a two-stream neural network is constructed by combining the network with the I3D network using images as input.The prediction results of the two branch networks are obtained by the average fusion method.Finally,the networks with different input characteristics are trained and verified on the kinetic dataset.The experimental results show that compared with the network using video frames,the recognition accuracy of the two-stream network model is improved by 7.6%.It is proved that sound can be used as an important basis for action recognition.3.This dissertation proposes a video action classifier based on text information,which reduces the amount of manual participation in the construction of video data sets by marking the content of video caption text.Firstly,based on the Bert model,a speeach action classifier is designed,which can infer the corresponding action category through the understanding of text information.Then,from the collected 1000 movie scripts,the action speech data set for training the classifier is constructed,and the classifier that can recognize special actions is trained.Finally,this dissertation proves the effectiveness of the classifier by verifying the effectiveness of smovie dataset.In the experiment,the sMovie datasets and kinetic datasets limited to the same size train the same network,and migrated to ucf101 dataset for verification.The experimental results show that the average accuracy of the model trained with sMovie dataset is only 5.4%.This indicates that sMovie dataset contains a large number of effective video clips,which can be used to assist in network design and training.4.In this dissertation,a multi-stream network fusion method based on the intrinsic relationship of classes is designed to mitigate the impact of similar actions on classification results.The fusion method calculates the confusion matrix between action categories from the prediction scores of different network streams for different categories,so as to obtain the similarity relationship between action categories in different information flows.At the same time,combined with the existing achievements of this paper,a multi-stream neural network is designed.The multi-stream network structure consists of four independent networks.Different network branches use the spatio-temporal information flow,action information flow,audio information flow and text information flow provided by video to make full use of the rich multimodal information in video.In the experiment,the accuracy of the multi-channel network structure using the fusion method based on the intrinsic relationship of classes is 4.6%higher than that of the independent fusion method. |