| Image object detection is a fundamental and challenging task in the field of computer vision,which aims to detect the minimum bounding boxes that cover objects of interest in input images,and assign associated semantic labels synchronously.Recently,with the development of deep learning,object detectors have greatly improved the detection accuracy.In spite of achieving remarkable progress,the vast majority of high-accuracy detectors involve hundreds even thousands of convolutional layers and feature channels,where model size and implementation efficiency are unacceptable for real-world applications that require on-line estimations and real-time predictions,such as self-driving,robot vision.In order to adapt to real-world scenarios,a vast number of lightweight networks have been proposed for real-time object detection,which always have few convolution layers using single-path backbone.Single-path architecture,however,involves continuous pooling and downsampling operations,always resulting in coarse and inaccurate feature maps that are disadvantageous to locate objects.On the other hand,due to limited network capacity,recent lightweight networks are often weak in modeling global relationship.In addition,existing lightweight models tend to use simple architectures in the neck and head of the detector for faster inference speed,which ignores the correlations among different features.Based on the above problems and researches,this thesis mainly conducts the following researches:(1)Due to the fast down-sampling strategy in the shallow stage of lightweight detectors,it is easy to cause the loss of high-resolution details and affect their extraction.To solve this limitation,this thesis presents a dual path network,named DPNet,for efficient object detection with lightweight selfattention.DPNet adopts a parallel path architecture,leading to a dual-resolution backbone,where high-level semantic cues are encoded in the low-resolution path and low-level spatial details are extracted in high-resolution path,both of them are important to object detection.In backbone,a single input/output lightweight self-attention module(LSAM)is designed to encode global interactions between different positions.Extensive experiments on the MS COCO dataset demonstrate that DPNet achieves promising trade-off between detection accuracy and implementing efficiency.(2)To alleviate the problems caused by the limit capacity of modeling global correlations,this thesis designs En-DPNet,in which LSAM is improved to lightweight self-correlation module(LSCM)on the basic of DPNet-S.More specially,LSCM uses larger pooling window in spatial attention to preserve spatial details and explore pixel-to-region relationships.In the channel attention,LSCM maintains relatively more feature channels and investigates channel-to-group-channel dependencies.Moreover,previous work DPNet adopts common feature pyramid network in neck part,which mainly aggregates multi-scale features via bilinear interpolation and element-wise addition,this simple fusion strategy ignores dependencies across features with different size,motivating us to extend LSCM into a multi-input version LCCM,used to aggregate cross-resolution features from different convolution layers.Experimental results demonstrate that En-DPNet achieves 29.6% AP on MS COCO 2017 test-dev and 79.2% m AP on Pascal VOC 2007 test set,together with nearly 2.5M model size,1.0GFLOPs,as well as 164 FPS and 196 FPS for 320 × 320 input images of two datasets.(3)To alleviate the limitation that the two-sub tasks of object detection: regression and classification task lack information communication and require different types of features,this thesis designs an interactive attention module(IAM)on the basic of En-DPNet,called Eh-DPNet.In the detection head,the features of classification branch and regression branch are modeled by IAM in channel and spatial dimension,generating the features required by their respective tasks and strengthening the information interaction between two tasks.Experimental results demonstrate that Eh-DPNet achieves 30.4% AP on MS COCO 2017 test-dev,together with nearly 1.06 M model size,2.75 GFLOPs,and 161 FPS for 320 × 320 input images. |