With the vigorous development of science and technology,video surveillance has been widely used in important public places at a low cost,such as traffic arteries,stations,airports,parks and schools.Accurately estimating the number of objects in videos or images is a significant and challenging task.In various target counting tasks,crowd counting has a wide range of practical applications,and plays an important role in many fields.These include defense warning,urban planning,intelligent commerce,traffic scheduling and so on.The main task of crowd counting is to accurately estimate the number of people and the spatial distribution of crowd density.However,there are some problems in crowd scenes,for instance,occlusion,light change,chaotic background and scale variety,which make crowd counting more difficult.For the needs of crowd counting in complex crowd scenes,this dissertation starts with representation learning of crowd features,constructs the counting network model,designs the feature extraction algorithm,and then proposes the study of image crowd counting model based on convolutional neural networks in complex scenes.The specific research contents are as follows:Firstly,aiming at the problem of high density and nonlinear distribution in complex crowd scenes,this dissertation proposes a crowd counting model using cross-adversarial loss and attention context information.Based on the design of two generative adversarial networks,the multi-scale features and boundary information are obtained by applying dilated convolution on the U-net structure of the generator,and the global space and more texture feature details are focused combined with the attention mechanism.And then,the regression training is optimized by integrating multiple loss functions dominated by crossadversarial loss functions.After that,residual information generated by cross-adversarial joint training and context information are integrated into feature extraction and density map estimation to deal with the negative impact of complex background and nonlinear crowd distribution on counting accuracy.Secondly,aiming at the scale variety of complex crowd scenes,this dissertation proposes a crowd counting model based on fusion of scale aware information and dual attention aware information.The shared network is designed to adaptively encode multiscale context information,and the perspective transform on convolution is connected to further smooth the transition between scales.And the internal semantic information is encoded through double attention mechanism from the dimensions of location and channel at the same time,which can strengthen the expression of feature information.Furthermore,scale aware and attention aware are combined to complete feature extraction and density map estimation to solve the interference of obvious scale variety in complex crowd scene on accurate crowd counting.Thirdly,aiming at the problems of few image data,difficult labeling and adaptive cross-scene application,this dissertation proposes a scene adaptive crowd counting model guided by affine parameters.Using the idea of few-shot unsupervised image-to-image translation,the supervised network generates the affine parameters of batch normalization in the counting network through a small number of unlabeled images,so that the network can adapt to different target scenes.In crowd feature acquisition,more detailed features are obtained by adding multi convolution kernel fusion,and shallow features and deep features are fused with the help of channel attention.Combining the scene demand of specific crowd with the adaptive monitoring parameter network to realize feature representation and density map estimation,which can solve the demand of network model for labeled data and cross-scene application.Finally,aiming at the counting problem of unconditionally restricted crowd scenes,this dissertation proposes a crowd counting model based on cross-modal collaborative representation learning with region recognition.Through the dual information dissemination mechanism,the modality-shared and modality-specific representation are dynamically enhanced to make full use of the complementarity of optical information and thermal imaging information.At the same time,the region recognition design of visible image and the regional characteristics of thermal imaging information are used to achieve the purpose of background perception.Combined with the point annotation density contribution probability modality,feature extraction and density map estimation are completed together with multi-modal representation learning integrating background perception,so as to solve the problem of counting accuracy of unrestricted crowd scenes under different light conditions. |