Font Size: a A A

Research On Video Classification And Detection With Deep Learning

Posted on:2020-10-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:K YangFull Text:PDF
GTID:1488306548991769Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Video is an important source of human access to information.In the information age,the data is growing rapidly.With the popularity of various video capture devices,people have taken and uploaded videos on a daily basis,and a large number of videos have been accumulated on the Internet and social media.According to statistics,there are 300 hours of videos uploaded to a well-known video website every minute.Such a large amount of video content has brought us tremendous opportunities,but also brought a very big challenge.With such a large amount of video,it is almost impossible to analyze and filter one by one using human resources.Therefore,we urgently need to develop an intelligent system that can automatically understand and analyze video content.A key technical challenge for understanding video content is to understand the goals and behaviors contained in the video.For example,a paid video site needs to know which person and sports activities are included in each sports competition video to provide users with more accurate recommendations;video editors need to automatically extract relevant video segments from the video based on the keywords of the desired content;Self-driving cars need to know the location and action of various other vehicles and pedestrians within the safe driving range to take appropriate action.In order to understand the objects and actions in the video,this paper optimizes the video classification and video detection accuracy and speed.This paper focuses on several key technologies involved in video classification,video temporal detection and video spatial detection.The main work and novelties of this paper are summarized as follows:(1)An information fused temporal transformation network for video action recognition is proposed.Effective spatiotemporal feature representation is crucial to the video-based action recognition task.Focusing on discriminate spatiotemporal feature learning,we propose Information Fused Temporal Transformation Network(IF-TTN)for action recognition on top of popular Temporal Segment Network(TSN)framework.In the network,Information Fusion Module(IFM)is designed to fuse the appearance and mo-tion features at multiple ConvNet levels for each video snippet,forming a short-term video descriptor.With fused features as inputs,Temporal Transformation Networks(TTN)are employed to model middle-term temporal transformation between the neighboring snip-pets following a sequential order.As TSN itself depicts long-term temporal structure by segmental consensus,the proposed network comprehensively considers multiple granu-larity temporal features.To our knowledge,this is the first work that combine short-term spatiotemporal feature fusion,sequentially middle-term temporal modeling and long-term temporal consensus.Our IF-TTN achieves the state-of-the-art results on two most popular action recognition datasets:UCF101 and HMDB51.Empirical investigation reveals that our architecture is robust to the input motion map quality.Replacing optical flow with the motion vectors from compressed video stream,the performance is still comparable to the flow-based methods while the testing speed is 10x faster.(2)A temporal preservation network for precise temporal action localization is proposed.Temporal action localization is an important task of computer vision.Though a variety of methods have been proposed,it still remains an open question how to predic-t the temporal boundaries of action segments precisely.Most works use segment-level classifiers to select video segments pre-determined by action proposal or dense sliding windows.However,in order to achieve more precise action boundaries,a temporal lo-calization system should make dense predictions at a fine granularity.A newly proposed work exploits Convolutional-Deconvolutional-Convolutional(CDC)filters to upsample the predictions of 3D ConvNets,making it possible to perform per-frame action predic-tions and achieving promising performance in terms of temporal action localization.How-ever,CDC network loses temporal information partially due to the temporal downsam-pling operation.In this paper,we propose an elegant and powerful Temporal Preservation Convolutional(TPC)Network that equips 3D ConvNets with TPC filters.TPC network can fully preserve temporal resolution and downsample the spatial resolution simulta-neously,enabling frame-level granularity action localization with minimal loss of time information.TPC network can be trained in an end-to-end manner.Experiment result-s on public datasets show that TPC network achieves significant improvement in both per-frame action prediction and segment-level temporal action localization.(3)A frame segmentation network for temporal action localization is proposed.Temporal action localization is an important task of computer vision.Though many meth-ods have been proposed,it still remains an open question how to predict the temporal location of action segments precisely.Most state-of-the-art works train action classifiers on video segments pre-determined by action proposal.However,recent work found that a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries.In this paper,we propose a Frame Segmentation Network(FSN)that places a temporal CNN on top of the 2D spatial CNNs.Spatial CNNs are responsible for abstracting semantics in spatial dimension while temporal CNN is responsible for introducing temporal context information and perform-ing dense predictions.The proposed FSN can make dense predictions at frame-level for a video clip using both spatial and temporal context information.FSN is trained in an end-to-end manner,so the model can be optimized in spatial and temporal domain joint-ly.We also adapt FSN to use it in weakly supervised scenario(WFSN),where only video level labels are provided when training.Experiment results on public dataset show that FSN achieves superior performance in both frame-level action localization and temporal action localization.(4)An end-to-end weakly supervised object detection network is proposed.It is challenging for weakly supervised object detection network to precisely predict the positions of the objects,since there are no instance-level category annotations.Most ex-isting methods tend to solve this problem by using a two-phase learning procedure,i.e.,multiple instance learning detector followed by a fully supervised learning detector with bounding-box regression.Based on our observation,this procedure may lead to local minima for some object categories.In this paper,we propose to jointly train the two phases in an end-to-end manner to tackle this problem.Specifically,we design a sin-gle network with both multiple instance learning and bounding-box regression branches that share the same backbone.Meanwhile,a guided attention module using classification loss is added to the backbone for effectively extracting the implicit location information in the features.Experimental results on public datasets show that our method achieves state-of-the-art performance.
Keywords/Search Tags:Video Classification, Video Detection, Convolutional Neural Network, Attention Mechanism, Computer Vision, Deep Learning
PDF Full Text Request
Related items