Font Size: a A A

Spatial-temporal Context And Temporal Scheduler For Convolutional Neural Network Based Video Object Detection

Posted on:2020-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:H LuoFull Text:PDF
GTID:2428330599959585Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Video object detection task includes object localization and object classification,which is a fundamental task in computer vision.In realistic life,there are a mass of applications for video object detection,e.g.autonomous driving,video surveillance and intelligent city.Recent cutting-edge feature aggregation paradigms for video object detection rely on inferring feature correspondence.The feature correspondence estimation problem is fundamentally difficult due to poor image quality,e.g.,motion blur,video defocus and object occlusion.Accordingly,the results of feature correspondence estimation are usually unstable.Besides,when applying video object detection algorithm to application in actual scene,there is high request about speed and performance because of the limited computation capacity.Most state-of-the-art video object detection algorithms only take the recognition accuracy into consideration.To handle with these problems,we propose two solutions in terms of accuracy and speed,respectively.Specifically,the main contributions of this paper are as follows:1.Based on spatial-temporal context in video,we propose a proposal-level feature aggregation framework,which learns to enhance proposal's feature by modeling the dependency among proposals from intra-and inter-frames.With due consideration of visual feature,spatial position and temporal position,it makes full use of spatialtemporal context.The proposed method has the following merits: it does not need any hand-crafted design,e.g.the feature wrapping process and is fully trainable.It circumvents the challenging problem of accurate feature correspondence estimation,which makes it robust to low quality image frames.It can capture the temporal consistency particular to video.Finally,we verify the validity of it on Image Net VID dataset.The proposed method improves the single frame Faster R-CNN baseline by about 6% and outperforms the previous state of the art by 1.4% m AP under the setting of no temporal post-processing.2.Based on convolutional neural network,we propose a light-weight Dor T(Detect or Track)framework for video object detection.The proposed Dor T framework formulates video object detection as a sequential decision problem and achieves good trade-off between speed and accuracy via the combination of image object detection,singleobject tracking and an accurate learnable scheduler.It's in real-time(over 30FPS),with low latency and capable of associating an object,which highly meets the demands of scenarios like autonomous driving.Eventually,we validate the effectiveness of the proposed method in the large-scale video dataset Image Net VID.
Keywords/Search Tags:Object detection, Convolutional neural network, Feature aggregation, End-to-end
PDF Full Text Request
Related items