With the rapid development of artificial intelligence in recent years,the traditional research directions in the field of computer vision have been supplemented and improved by deep learning methods.Object detection is one of the core topics in computer vision.It’s also the basis of many other vision tasks.Compared with traditional object detection methods,method based on deep learning has higher accuracy and stronger robustness.At the same time,it has been widely used by virtue of its detection speed advantages.Among them,the SSD(Single Shot MultiBox Detector)has high detection accuracy and speed at the same time,but it still has problems such as low detection accuracy for small objects.Therefore,this paper proposes an improved SSD network model based on clustering and feature fusion,which improves the detection accuracy while maintaining a faster speed.The main research contents of this paper are as follows:(1)The generation scale of the prior frame in the original SSD model is a fixed value artificially set.This design allows the model to have better generalization,but it is not suitable for the identification and positioning of specific objects.Since the width and height of a single target are usually stable in a range,this paper chose the K-Means method to cluster the aspect ratio of each category of targets,and set the aspect ratio of the prior frame according to the clustering results,which can make the network fine-tuning faster.And this advantage improved the positioning accuracy of the model;(2)Aiming at the insufficient detection accuracy of the SSD model for small objects,this paper proposed a feature fusion method based on the feature pyramid structure.The shallow feature’s semantic information is insufficient.The deep feature has enough semantic information,but the target location information will be lost along with the convolution process.Based on the multi-scale features of the feature pyramid structure,this paper used the deconvolution network to extract the semantic information of the deep network,used dilated convolution network to extract the shallow network location information,and used convolution for the middle layer features to reduce the number of channels.The new feature layer was used for object classification and positioning.The experimental results on the PASCAL VOC image data set showed that this paper’s model significantly improved the detection accuracy of small targets. |