Font Size: a A A

Deep Learning Based Dynamic Object Detection In Videos

Posted on:2020-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:X B LiFull Text:PDF
GTID:2428330620459986Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Human beings acquire more than 80%sensory information through vision from their environment.Thus,computer vision are becoming a more and more important topic in artificial intelligence,both theoretically and practically.Currently,visual information is represented in a discrete spatial and temporal color space.It is still a challengeing task how to implement a machine with vision such that it is able to extract semantic information like human.There exist quite a large number of investigations of static image,and the tasks include classification,segmentation,object detetion,object tracking,etc.This paper concentrates on object detection in videos.This task aims at finding specific objects and their spatial locations in videos,which is a fundamental issue for many vision recognition problems,like object tracking,and is widely used in many applications like security,auto-driving,etc.Since 2012,convolutional neural networks have been applied to several tasks in computer vision,especially those on static images.There have been a number of successful models for static image object detection.Two-stage methods firstly generates region proposals where objects may locate,and then filters out the proposals that are apparently located in the background.Then the remaining proposals are fed into a convolutional network for classification and boundary fine-tuning.While in one-stage methods,the proposals are fed into convolutional models without a filtering process,thus one-stage methods are end-to-end methods.The filtering process in two-stage methods is able to filter out the regions located in the background,thus two-stage methods can usually reach a high accuracy,but may suffer from the time complexity.Wheras,one-stage methods are usually fast but may result in a relatively lower accuracy.A video is actually a collection of static images in the time axis,so a very simple idea in video object detection is just apply methods for image object detection on each frame in videos.However,such a simple extension does not take the temporal information into consideration,so it may suffer from low accuracy due to occlusion,blur and illumination condition change.This paper attempts to leverage the temporal relationship among the frames to improve the detection accuracy but keep sufficient time effeciency.Thus this paper adopts a one-stage method,adding the model with structures that deal with temporal information.The main contribution of this paper is the algorithm that utilizes a popular pre-trained network as a backbone to extract some key features,and then feed the output of several levels into the feature pyramid network to extract pyramid features as well as generate the region proposals,utilizing the relationship between different levels to model the objects in different scales.The features are further fed into two subnets,classification subnet and regression subnet,for fine-tuning the boxes.Temporal features are leveraged in the two subnets.In the classification subnet,focal loss is used to reduce the influence of the big imbalance between foreground and background.A non-maximum-suppression is deployed to the rsult of the two subnets and the final results are generated.Experiments on the auto-driving dataset by Udacity has shown that this method is effective.It is proved that if the recurrent unit is deployed to only one of the two subnets,the improvement of the detection is little.However,if the recurrent units are deployed to both of the subnets,the detection performance is improved significantly.Besides,by the comparison between focal loss function and the traditional cross entropy loss function,it is shown that the focal loss function helps a lot in releasing the drawback due to the big imbalance between foreground and background.And at last,experiments about the computational performance are carried out and the results prove that the new structures do not add to much time complexity.
Keywords/Search Tags:Computer Vision, Deep Learning, Convolutional Neural Networks, Recurrent Neural Networks, Focal Loss, Object Detection
PDF Full Text Request
Related items