| The explosive growth of video data in the network platform has promoted the wide applications of video understanding.Video data contains not only static images,but also the temporal correlation information among these static images.Accurate recognition of video actions requires effective extraction and fusion of these information.However,in the existing methods of video action recognition,there are still some deficiencies on the extraction and fusion of temporal and spatial information to obtain full spatio-temporal information,the efficient modeling of long-range dependencies on spatio-temporal features,and the use of short-term motion information to enhance the spatio-temporal modeling ability of the mode.This greatly affects the recognition accuracy of video actions.In view of the above issues existing in video action recognition technology and its application in driving behavior recognition,the following research has been carried out in this thesis:Firstly,to realize the different processing of static appearance information and dynamic motion information to enhance the feature extraction ability,a static and dynamic shunt aggregation module(SDSAM)is proposed in this thesis.By introducing the channel attention,SDSAM processes the features that go through the time convolution to obtain the score of the richness of time information at the channel level.Then it takes the richness score as a weight and aggregates the original feature which is processed by time convolution in the form of weighted sum.The selection of features including rich motion information and rich spatial information is based on the richness score of the feature itself in SDSAM.So it is learnable,which not only avoids human intervention,but also makes different responses to different locations of the network.This improves the ability of the model to extract time information.Secondly,to solve the disadvantage that convolutional neural network(CNN)is difficult to capture long-range dependencies and improve the ability to simulate the long-range dependencies of spatio-temporal features,the remote spatial and temporal correlation enhancement module(RSTCEM)is proposed.RSTCEM averages the length and width of the feature,respectively.Next,it uses time convolution to transform the spatial representation into the spatio-temporal representation containing time information.Then the correlation graph is obtained through matrix multiplication,which includes the spatio-temporal relevance of each location in the feature space.According to this correlation graph,the region with strong spatio-temporal correlation in the original feature is enhanced.Therefore,RSTCEM can improve the model’s ability to perceive the correlation between interactive objects in the video.The calculation of spatio-temporal correlation graph does not depend on the pixels of each location in space.As a result,RSTCEM is lighter than some long-range dependency capture methods based on self-attention.Finally,to improve the ability to extract short-term motion information,the motion space enhancement module(MSEM)is proposed in this thesis,which is the improment of the motion excitation(ME)module.MSEM obtains the motion representation through the original features.Then the spatial attention mechanism is used to process the motion representation for enhancing the parts that are useful for extracting short-term motion information.In the channel dimension,motion representation is used to enhance the features of the short-term motion in the original channel.Due to the introduction of spatial attention mechanism,MSEM enhances the influence of useful parts in motion representation.Furthermore,it reduces the interference of useless background and drift caused by relative motion on motion information extraction.As a result,it greatly improves the recognition accuracy of the model.In this thesis,by combining the above three modules,video action recognition models are designed and tested on the Something-SomethingV1 dataset,and the recognition Top-1 accuracy of 48.1%,48.4%and 49.1%are achieved,respectively.The experimental results show that,compared with some existing models,our proposed video action recognition models have better performance.Furthermore,the model with relatively small number of parameters and computation is trained with the help of knowledge distillation for driving behavior recognition in this thesis.The amount of calculation is reduced by half,while the accuracy of the model is only reduced by 0.2%.The experimental results show that the model we designed can meet the requirements of driving behavior recognition. |