| With the advancement of technology,unmanned aerial vehicles(UAV)and artificial intelligence algorithms are gradually being applied in daily life from professional fields.Combining UAV with object detection algorithms can better help users solve problems.Nowadays,deep learning has become the mainstream direction of research in the field of object detection.However,deep learning models require a lot of computing power,and some edge devices cannot be deployed.In this thesis,we study lightweight algorithms and small object detection for UAV,considering the limited memory and processing power of the onboard devices and the small size of targets observed from a bird’s-eye view.The main research contents of this thesis is as follows:(1)We chose the YOLOv5 algorithm,a single-stage object detection model with fast processing speed.In response to the predominance of small targets in drone images,chapter three of this thesis takes detection methods as the starting point and adds a set of detection anchor boxes to the shallow feature map to identify information on small targets in the shallow feature map.In the detection header,Double Head is used to replace the original detection head,allowing the regression of the target box and sample classification to be trained separately.In the backbone network,Ghost Block lightweight modules are used to replace the C3 structure,reducing redundancy in the feature map to reduce the number of model parameters.The CBAM attention mechanism is introduced to improve the model’s attention to key areas,enabling the model to fully utilize context information when extracting features and improving the model’s detection accuracy.Geometric parameters such as aspect ratio penalty and center point distance between two boxes are introduced in the loss function to improve the model’s regression accuracy.In the experimental section,the performance of the model is evaluated using the Vis Drone2021 dataset of drone images;(2)Through analyzing the advantages and disadvantages of Convolutional Neural Networks(CNNs)and Transformer structures,it is understood that CNNs have a perception deficit for global information due to their feature extraction method of stacking convolutional kernels.On the other hand,Transformer structures use self-attention to obtain contextual information for images,but this requires a large amount of computation and is not conducive to embedded devices.This thesis combines the CNN and Transformer structures on the basis of YOLOv7-Tiny to extract global information from feature maps.Since Transformer structures require a large dataset for fitting,this thesis uses Mosaic data augmentation and Mix Up data mixing to increase the dataset and improve the model’s generalization ability.The Transformer structure uses Mobile VITv3 modules designed for mobile devices and incorporates the CA position attention mechanism to improve model accuracy.To alleviate the phenomenon of partial information loss in feature maps caused by nearest neighbor interpolation,this thesis adopts the CARAFE upsampling operator. |