With the construction of smart cities and population growth,the number of safety incidents caused by large-scale crowd gathering activities is increasing.Therefore,it is important to use crowd counting methods for crowd monitoring and analysis in public places.With the development of convolutional neural networks,the accuracy of crowd counting models has been significantly improved,however,in practical scenarios,the complex background interference problem caused by the multiplicity of surveillance scenes and the scale variation problem caused by different views of surveillance devices are important challenges that affect the accuracy of crowd density estimation,and in the face of these challenges,two crowd counting models are proposed in this paper,and the specific research results are as follows:First,to address the problem of continuous scale variation in complex scenes,this paper proposes a novel end-to-end Scale-adaptive Attention Network(Sa ANet)for crowd counting,which can adaptively acquire multi-scale features of crowds and suppress the interference of background noise by combining a scale-adaptive decoder and an attention mask generator to generate high-quality density maps,thus achieving more accurate counting.To alleviate the problem of blurred predicted density maps due to the reduced resolution caused by down-sampling,the model introduces an attention gate network between encoder and decoder to effectively fuse the features learned by the encoder with the scale-adaptive decoder,and thus improve the quality of the predicted density maps.Secondly,for the problem of perspective changes in crowd images and video surveillance scenes,this paper proposes a Spatial Feature Learning based Crowd Counting Network(SFLNet),which enhances the learning of information flow within the convolutional layer and obtains spatial context information through the spatial feature encoding module,and then effectively reduces the model counting errors caused by perspective changes.Due to the specificity of perspective change,i.e.,different positions of the crowd from the view plane,its scale size in the 2D image is presented differently.For this purpose,the multi-scale dense fusion module is introduced,which can fully acquire the features of the crowd at different scales through the dilated convolution and the multi-column structure design of dense connection,effectively improving the accuracy of model density estimation.The models proposed in this paper are validated on four publicly available authoritative datasets,Shanghai Tech,UCF_QNRF,UCF_CC_50 and World Expo’10.The experimental results show that the Sa ANet and SFLNet proposed in this paper achieve competitive performance,and the effectiveness of each module proposed in this paper is verified by ablation studies. |