Font Size: a A A

Video Object Detection Based On Adaptive Convolution Network And Visual Attention Mechanism

Posted on:2021-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:L F FanFull Text:PDF
GTID:2518306050970909Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the increasing popularity of intelligent surveillance systems and handheld shooting devices in daily life,video data has grown exponentially,and video object detection has become one of the vital problems in the field of computer vision.In recent years,deep learning technology has greatly promoted the development of image object detection,and various model algorithms have emerged endlessly,which laid the foundation of video object detection.However,compared with images,video data has the characteristics of high similarity between frames,timing correlation of adjacent frames,and large amount of data.And there are some unique challenges in data such as scale changes,appearance changes,object occlusions,and motion blur caused by the relative motion of objects and shots.The existing mainstream algorithms either use the correlation between video frames to fuse features to improve the accuracy of the model,or use the correlation between video frames to transfer features to speed up the speed of the model.To solve issues discussed above,we carry out further research in feature extraction,feature fusion,loss function,and key frame selection,to balence the accuracy and speed of video object detection,and explore its effectiveness by experiments.The main contents of this thesis are as follows: 1.Due to the problem of too many small targets in the video,an object detection network based on multi-scale feature cross-fusion and IOU(Intersection over Union)loss is proposed and implemented in this thesis.Use the multi-scale feature cross-fusion module to achieve multi-scale fusion of detailed information of shallow features and semantic information of deep features,which improve the network's recall rate for dense small targets.The IOU loss related to box area is used to regress coordinates of the bounding box as a whole to overcome the problem that the network equally treats the same position deviations with different influence on large and small targets.When the bounding box is and is not completely overlapping with the groundtruth,accelerated the convergence speed of the model by the center distance factor and area factor.Experiments show that this method can effectively improve the accuracy of object detection,especially the recognition rate of small targets.2.Due to the appearance changes,scale changes,rotations,etc.that are common in video,an object detection network based on multi-scale adaptive convolution and channel attention is proposed and implemented in this thesis.The multi-scale adaptive convolution module uses standard convolution to extract image features,the deformable convolution to learn the position changes of the object,and the dilated convolution to expand the receptive field while retaining more detailed information,so that the model can learn the geometric transformation of the object at different scales.And use channel attention to adaptive weighted fusion the convolution results of three different branches of the module.Experiments show that this method greatly improves the accuracy of the model in detecting non-rigid deformation objects.3.Due to the problem of the motion blur and object occlusion in the video,an adaptive key frame selection strategy is proposed in this thesis.Key frames can be dynamically determined according to the similarity of SIFT(Scale-invariant Feature Transform)features between video frames to realize dense sampling when the similarity is less than a certain threshold,and the features of key frames are extracted by the basic convolutional network.Sparse sampling when the similarity is greater than the specified threshold,feature maps of the preceding keyframe are warped based on the correlation between the spatial positions of the frames provided by the optical flow information,then propagate to others non-key frames.Along time axis to constructs high-confidence bounding boxes sequences from consecutive frames.Boxes in the sequence are re-scored,and other boxes are suppressed which close to this sequence in the same frame.Compared with state-of-the-art algorithms such as Deep Feature Flow and Impress Network which select keyframe at certain interval,we achieves a good balance between detection speed and detection accuracy,in addition overcome the influence of motion blur and object occlusion on detection results.
Keywords/Search Tags:Video object detection, Multi-scale feature fusion, Adaptive convolution, Attention mechanism, Adaptive key-frame selection
PDF Full Text Request
Related items