| Moving object detection in the traffic scene is of great importance in computer vision,and it is also an important part of the unmanned driving system and auxiliary driving system.However,due to the attributes of the moving object and the complex background,there are many problems in the single frame,such as motion blur,defocus blur and object occlusion,which bring great challenges to the moving object detection task.Considering the above problems,object detectors based on static images demonstrate poor performance if they are applied to traffic scene datasets directly.However,it is found that there will always be several frames with high quality of features in a group of consecutive video frames,which can make the target detector show ideal performance.Therefore,this paper improves the target detector based on static images,by extracting the motion information of targets and fusing the features of adjacent frames,to improve the feature quality of the current frame.The main work is as follows:In this paper,we designed and implemented a feature alignment network based on deformable convolution.Because the spatial positions and attitudes of objects in different image frames are not consistent,if features are fused directly,it will lead to the feature dislocation and superposition at multiple times,which is not conducive to the target detection,so the feature alignment operation must be carried out before the feature fusion.This paper explores the performance of the Fameback optical flow method and deformable convolution in feature alignment,and finally determines to use deformable convolution to achieve feature alignment.Deformable convolution can learn the pixel level correspondence between two frames of target features,and use its powerful spatial transformation ability to map features.Firstly,we input feature maps of the current frame and the adjacent frames,and use the network structure based on the deformable convolution to model the object motion information and map the features of the adjacent frame to the current time.This paper implements the spatiotemporal feature fusion module based on the non-local network,to aggregate the features of adjacent frames.In the process of feature fusion,cosine similarity is used as the distance measure.The more similar the feature is,the higher the cosine similarity is,and the greater the weight is.At the same time,this paper uses the residual structure of the non-local network to reduce the training difficulty.According to the contribution of features to the task,this fusion method can effectively aggregate features in different temporal and spatial positions.In this paper,firstly,the cosine distance of the aligned feature maps is calculated and the weight is normalized through the weight network,then the features of the adjacent frames are weighted and summed to get the aggregated features.In this paper,the Centernet network is used as the sub network to detect moving objects.The Centernet network is a single-stage object detector.It is faster than faster RCNN,At the same time,it does not use anchors as prior candidate boxes,which improves the generalization performance of the network on different data sets.In order to improve the speed of the model on the premise of ensuring the detection accuracy,this paper selects the problem frame through image similarity,image clarity and target motion scale,and only performs feature fusion in the problem frame.To verify the effect of the improved moving target detector,this paper combines the UA-DETRAC and KITTI traffic scene dataset to design experiments and analyzes the performance of the improved model from the aspects of the object categories of the dataset and the detection difficulty.At the same time,it is compared with the commonly used object detection model and the improved model,to verify the effectiveness of the proposed algorithm. |