Computer vision is an important field of artificial intelligence that covers various areas such as surveillance recognition,face recognition,text recognition,and autonomous driving.Object detection is a fundamental problem and the basis for other computer vision tasks.Currently,improving the performance of models by increasing their complexity and dataset richness has reached a bottleneck.Another approach is to introduce attention mechanisms to enhance the model’s expressive power and improve its detection accuracy.The research in this dissertation explores new attention mechanisms and structures and applies them to detection models.Attention mechanisms are adaptive weights generated by deep learning models,which can help the model find certain features in the dataset.Using attention information of different dimensions can focus on improving the model’s expression ability in a certain aspect,such as the importance of spatial position or feature mapping.Therefore,exploring different methods of attention generation of different dimensions and applying them to object detection is a worthwhile research direction to try.Currently,attention mechanisms used in detection models are almost all focused on the backbone network used for feature extraction.Due to the deep network of the backbone,modifying it is not only complex,but applying attention mechanisms to it will also greatly increase the model’s complexity.On the other hand,current attention mechanisms often only focus on one dimension,lacking the ability to explore comprehensive information,and most of the attention generation processes are like black boxes with poor interpretability.Based on the above issues,this dissertation improves the ability of the object detection model through two strategies.One is to redesign the model’s structure,applying the feature maps from the model’s backbone to a cascaded multi-level attention structure to generate attention information of multiple dimensions,thus fully exploring different dimensions of information and improving the model’s final feature representation.The other is to consider the black box nature of deep learning models and the large amount of feature expression data generated by convolutional kernels,and attempt to introduce statistical information into the model.By using classical statistical knowledge,global spatial attention can be achieved,thus improving the model’s localization accuracy and helping the model build better attention,thereby enhancing the model’s performance.The methods and structures designed in this paper have been validated on multiple datasets.Firstly,the model was pre-trained on a dataset containing a large number of images.Then,it was further trained on multiple datasets and tested on multiple test datasets.The effectiveness of the proposed methods was verified through ablation experiments conducted on multiple datasets. |