Font Size: a A A

Research On Human Action Recognition Based On Deep Learning

Posted on:2020-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:W T LiuFull Text:PDF
GTID:2428330620462251Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Video understanding has broad application scenarios in the fields of humancomputer interaction,video classification,and automatic driving.The method of using deep learning for intelligent video analysis has gradually received attention.The success of neural networks in the image direction provides an idea for solving video understanding problems,especially human action recognition problems.In reality,the conditions of illumination,background,camera motion and so on are variable.The robustness of action recognition using manual extraction features is not good.The deep learning method is more adaptable to data.In the method of deep learning for human action recognition,the two-stream convolutional neural network inputs the RGB image and the extracted optical flow into the spatial stream convolutional neural network and the temporal stream convolutional neural network respectively to extract features and make classification.The problem of convolutional neural networks with feature extraction is shallow,and over-fitting occurs when training on smaller datasets using deep network models.At the same time,the method of sampling single-frame images and single-stack optical flow frames from video lacks long-term time modeling,ignoring the correlation and timing correlation of video local time period features.And the effect of sample imbalances on training results is not considered in the training process using action recognition data sets.The main research work of this paper is as follows:(1)For the problem that the feature extraction module of the spatio-temporal twostream convolutional neural network model has a shallow network layer,choose to use a deeper neural network to extract more effective features,and introduce a residual network module to prevent degradation caused by the network too deep.A method based on Spatio-temporal Two-stream Residual Network(STRN)is proposed.In the method,the data is less and the over-fitting is easy.The pre-trained model of the residual network on ImageNet is migrated to the human action recognition task,and the weight is initialized for the spatio-temporal two-stream residual network,and use a lower learning rate.Experiments show that the deep residual network extraction feature using this training method can achieve better results on the task.The STRN method achieved a recognition accuracy of 92.7% on the UCF101 data set.(2)For the problem that the time is not modeled for the spatio-temporal twostream residual network human action recognition method,a Temporal Feature Fusion Spatio-temporal Two-stream Residual Network(TFF-STRN)is proposed.According to the time series segmentation sampling,the input RGB image and the optical flow stack of the two-stream deep residual network are obtained,and each segment sample is input into the deep residual network to obtain time series segmentation features,and the appearance features are cascaded in time series,and then input to multi-layer perceptron to learn spatial classification feature.This method increases the time series information of the spatial stream segmentation feature,and the obtained feature is more effective.Temporal-stream convolutional neural network average deep residual network output segmentation motion feature to obtain action classification features.At the same time,a classification loss function with adjustment factor is introduced to reduce the contribution of the easy-to-separate sample to the total loss,so that the model pays attention to the difficult-to-separate sample,and solves the problem that the classification difficulty is not considered in the training process.The accuracy of the TFF-STRN method using time series feature fusion on the action recognition data set UCF101 reached 94.1%(3)Design and implement a short video classification system for sports.Through the system,the user shoots a clip or selects a sports short video that meets the requirements and uploads it to the back end.After the video is received by the back end,the human motion recognition algorithm is used to classify the video,and the video of the corresponding category is automatically stored for the video of different sports types,and the result is notified to the user.At the same time,the system also provides the function of viewing all the user-uploaded sports videos by category,and can display the latest uploaded videos of the user attention categories on the homepage.Through this system,users can upload videos without manual sorting,and can easily search and browse short sports videos of their own interest categories.
Keywords/Search Tags:optical flow extraction, feature extraction, residual network, multilayer perceptron, loss function
PDF Full Text Request
Related items