Font Size: a A A

Research On Action Recognition Based On Multimodal Feature Learning

Posted on:2021-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:M M CuiFull Text:PDF
GTID:2428330629452687Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Video-based action recognition is a hot topic in the field of computer vision.It has a wide range of applications in video surveillance,human-computer interaction,video information retrieval,and intelligent driving.With the explosive growth of various video data on the Internet in recent years,how to achieve effective and intelligent understanding and analysis of video data is critical.Traditional methods using only artificial feature extraction have many limitations and are not suitable for massive video data.Deep learning methods,especially deep convolutional neural networks,have made great progress in this area.The goal of action recognition problem research is to recognize and understand the actions of people in the video,and output corresponding labels.In addition to the spatial information existing in the two-dimensional image,the actions in the video data also increase the timing information of the action information.Due to the complexity of behaviors,changes in perspective,background noise and other objective factors.How to efficiently,accurately and comprehensively extract the spatiotemporal characteristics of action and design a reasonable and effective network structure is still a challenge.In order to solve the above problems,this paper designs a network based on multimodal feature learning for action recognition in video.The traditional two-stream method extracts spatial features from RGB images and extracts time-series features from optical flow.However,in this method,the time dimension information can only be extracted manually.Therefore,in order to more fully extract the spatiotemporal features,this paper adds an improved three-dimensional residual convolution neural network on the basis of the two-stream method,and combines the spatial features learned by the two-dimensional spatial network,the time-series features learned by the two-dimensional temporal network,and the time series features of 3D network learning are weighted and fused with category scores.Based on the idea of modeling remote time structures,it avoids a large amount of spatiotemporal information redundancy through sparse sampling.In the three-dimensional residual convolutional neural network,the solution of the 3󫢫 convolution integral is 1󫢫 and 3󪻑 convolution,which is equivalent to adding a one-dimensional pair on the basis of the two-dimensional convolution.The extraction of time information,and the use of global average pooling instead of fully connected layers,effectively reduces the amount of model parameters.Using this method of learning multi-modal features can effectively improve the recognition performance of the model.In this paper,experiments are performed on two commonly used data sets(HMDB-51 and UCF-101).Network training is performed through methods such as data augmentation and cross-input pattern pre-training to reduce the risk of overfitting the model.The experimental results show that the method proposed in this paper can effectively improve the recognition accuracy and has a good recognition effect on two data sets.
Keywords/Search Tags:Action recognition, Deep learning, Multimodal features, Convolutional neural network
PDF Full Text Request
Related items