Font Size: a A A

Multi-branch Deep 3-Dimensial Convolution Neural Network For Human Action Recognition In Videos

Posted on:2021-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y K HuangFull Text:PDF
GTID:1488306464959169Subject:Instrument Science and Technology
Abstract/Summary:PDF Full Text Request
In the modern information society,with all kinds of video equipment and software upgrades and information transmission costs continue to decline,video data is mass production and has become one of the main carriers of information dissemination.Consequently,the development of intelligent video analysis algorithms for human action recognition is one of the most important topics in computer vision research.Recently,the big data-driven supervised deep neural network method has obvious performance advantages and application prospects compared with the traditional artificial feature design method,while the deep 3D convolutional neural network is an efficient structure for extracting spatial features of video data,which is an important research area for realizing human action recognition.And its economical application in video processing is one of the problems to be solved.In this paper,a series of research and experiments are conducted on the migration construction of deep 3D convolutional neural networks and the design of video analysis architectures,which are respectively based on the construction of 3D convolutional networks and the simplification of the parameter transfer process,the architectural design of multi-temporal point feature extraction,the optimization of the redundancy of the underlying behavioral data,the aggregation pattern of high-level action features and the complementary enhancement of multi-model analysis,enabling 3D CNN to achieve state of the art even under the lack of video data pre-training condition.The main findings of this paper can be summarized as follows:(1)In terms of simplifying the process of constructing 3D convolutional networks and parameter transfer,a 2D-Inflated construction method for 3D convolutional networks is proposed This method replaces the computationally expensive transfer method which need to pre-training 3D CNN on large-scale video data by the efficient construction of spatially homogeneous and temporally heterogeneous 3D convolutional networks and the efficient migration of 2D convolutional network parameters,thereby reducing the training time and computational resources consumed in the construction of3 D CNN.The 2D-Inflated method proposed in this paper theoretically regards the transfer of 2D parameters as a domain adaptation process rather than an implicit pre-training process.Such method also views 3D CNN as ordered containers that hold multiple effective 2D CNNs,thus enabling 3D CNN to have their own adaptive analysis models at the spatial-temporal level.Compared with the previous inflated methods,the method in this paper not only simplifies the construction process of three-dimensional convolutional networks,but also extends the scope of application.(2)A multi-branch deep 3D convolutional neural network architecture for video analysis is designed.This architecture can extract high-level spatial-temporal features at multiple temporal points of the video data to obtain action semantic information on different video segments.Besides,such architecture aggregates the branch-level action features through the proposed residual full-connection layer to analyze the overall action information and long-term content in the video.This research further explores the influence of the branch number and video segement data on recognition performance for the optimal architecture.This paper experimentally compares the performance of five branch architecture designs under the temporal input of different capacities.The experiments show that the performance improvement of video analysis requires the number of architecture branches varies inversely with the input capacity of the timing information of each segment,and a 6-branch architecture is identified as the optimal video analysis mode.Meanwhile,the ablation experiments on the multi-branch feature aggregation mode show that the proposed residual full-connection network can effectively reduce gradient fragmentation and aggregate high-level action representations well to improve the recognition performance.(3)In complementary fusion method,this paper proposed the fusion pattern that combined multi-branch architecture and improved dense trajectories method.The improved dense trajectories method is the traditional hand-crafted method and can extract the feature descriptor in local spatial-temporal region,and the deep feature representation can be constructed by the end-to-end supervised learning process.The complementary fusion method in this paper takes the advantage of both methods and explores a series of fusion strategies.These experimental results prove that the complementary fusion method has performance advantage over model ensemble method.The research also applies the 2D-Inflated method and multi-branch architecture on densenly-connected networks,which demonstrates the universality of the proposed method of constructing 3D convolutional networks.Also,a combination of multi-model ensembled method and complementary fusion method are used to obtain advanced action recognition results.(4)A multi-stream network framework for multi-branched 3D convolutional neural networks is proposed.In this paper,the proposed 2D-Inflated method and multi-branched architecture are applied in the optical flow image domain.These methods are used to train a deep 3D convolutional network suitable for optical flow image processing.The further constructed multi-stream fused multi-branch 3D convolutional neural network improves the human action recognition.The finally constructed ensembled model achieves 95.8% recognition accuracy on UCF101 dataset and 75.2% recognition accuracy on HMDB51 dataset.The results proved that the deep3 D convolutional neural network can reach the state-of-the-art performance without the expensive video data pre-training process.
Keywords/Search Tags:Deep 3D convolutional neural network, Human action recognition, 2D-Inflated, Multi-Branch architecture, Multi-Stream network
PDF Full Text Request
Related items