| In recent years,the research on automatic counting of crowd scenes has become a hot research direction in the field of computer vision,now the biggest challenge of crowd counting algorithm is the high density of crowds and small targets in the scene,so some algorithms are very difficult to detect in the dense area of the image.To solve this problem,this paper designs two plug-and-play feature fusion modules,which can fuse image features from different layers and then the counting model can be adapted to image regions of different scales.The research contents and innovative achievements of this paper are as follows:(1)Aiming at the problem of large changes in the image scale of crowd scenes,this paper designs two kinds of multi-scale feature fusion modules:attention-weighted fusion module(AWF)and bottom-up fusion module(BUF).The attention weighted fusion module uses the attention branch to learn the feature weight of three different scale features,and finally outputs the weighted multi-scale features.The bottom-up fusion module gradually superimposes the low-level features and the deep-level features,and finally outputs a higherlevel image feature.Both fusion modules can process multiple input features into one output feature.(2)To verify the effectiveness of the multi-scale fusion module,this paper takes the feature fusion module as the core,and designs two crowd counting models respectively:Res Net50+AWF and Res Net50+BUF.Both models are divided into three parts: the backbone network,the fusion module and the decoding module.Only the fusion module is different,and the rest are the same.The Backbone Network of the crowd counting model is composed of Resnet50,which is used to process the input crowd images and extract the image features of three different scales.The decoding module is composed of two convolution layers,whose function is to decode the output features of the fusion module and predict the crowd density map.(3)To measure the effect of proposed counting models,this paper conducts experiments on the public datasets.The results show that the mean absolute error(MAE)of the Res Net50+AWF model on Shang Hai Tech dataset reaches 44.2(part A)and 7.4(part B),and the mean square error(MSE)reaches 100.4(part A)and 10.6(part B),while the MAE of the Res Net50+BUF model on this dataset is 52(part A)and 7.8(part B),and the MSE reaches104.2(part A)and 11.2(part B).On the more challenging UCF_CC_50 dataset,the MAE and MSE of Res Net50+AWF model are 224.3 and 322.6,and the two measure metrics of Res Net50+BUF model on this dataset are 239.5(MAE)and 324(MSE).(4)This paper also constructs a scenic crowd dataset,and uses the two proposed counting models to experiment on it,and its MAE is reduced to 1.74 and 2.32,and highprecision population estimation is achieved on this dataset.Besides,to further investigate the influence factors of the crowd counting model in the experiment process,this paper conducts a comparison experiment to verify how some factors interfere model performance,such as processing method of the datasets and the parameter of the training process.Finally,this paper uses the peak search algorithm to locate the crowd target on the density map and improves a density map based on the distance distribution,which achieves a higher precision target location. |