Font Size: a A A

Temporal Feature Fusion For Video Object Detection

Posted on:2020-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2428330602952394Subject:Engineering
Abstract/Summary:PDF Full Text Request
The research of deep learning in static images has made great progress,and its research in video has just begun.With the convenient acquisition of video data and the improvement of computer power,video object detection,as the basic task of video understanding,is one of the urgent problems to be solved in the field of computer vision.Compared with static images,video data has the characteristics of large amount of data,high redundancy,temporal context,and there are some unique challenges in the data,such as occlusion,motion blurring,video defocusing and object strange posture.The existing methods generally start with the redundancy of video data and temporal context,which can speed up video detection,or improve the accuracy of video object detection through temporal context.This paper mainly uses the temporal context of video data to improve the detection quality of indistinguishable images through temporal feature fusion.At the same time,by improving the network structure of video object detection,the detection speed can be improved in order to achieve the balance between speed and accuracy.In view of the idea of appeal,this paper mainly does the following three tasks: 1.A video object detection method based on Bi-ConvGRU is proposed.In this method,each frame in video sequence is divided into the current frame and reference frame.The current frame uses feature extraction network to get the corresponding current frame features.The reference frame combines with optical flow to get the optical flow estimation features.The relationship between the current frame and the current frame estimation features is learned by Bi-ConvGRU,and the results of Bi-ConvGRU are weighted by the embedded network to solve the weight.In this method,Bi-ConvGRU is used to introduce more reference frame information for current frame features,which improves the quality of current frame features.2.A temporal feature fusion method based on spatial location attention mechanism is proposed.This method improves the direction of feature propagation guided by optical flow in method 1,which reduces the process of feature extraction from network,but introduces the problem of position misalignment between estimated feature and current frame feature.This method designs a spatial location attention mechanism to replace the embedded network in method 1,realizes feature space position alignment,reduces the amount of network parameters,and improves the detection speed when the precision decreases slightly.3.A lightweight network based on non-local multi-scale temporal feature fusion is proposed.This method aims at the time overhead caused by introducing Bi-ConvGRU in methods 1 and 2.The structure is abandoned,and non-local modules are used to fuse the temporal features.Meanwhile,the back-end detection network is changed to a lighter structure,so that the network can run in a 4-gigabyte memory machine.In order to improve the robustness of the network to different scales,this method also attempts to introduce shallow and deep features to fuse,which improves the robustness to multi-scale images.This method combines the above structures,and achieves a better balance between detection accuracy and speed.
Keywords/Search Tags:video object detection, feature fusion, Bi-ConvGRU, attention mechanism, non-local module
PDF Full Text Request
Related items