Crowd counting is an effective tool for analyzing the behavior of crowds in public places.Its purpose is to automatically count crowds through image processing techniques,and to predict crowd size and density estimates for crowd monitoring or urban plann ing.Well-performing crowd counting algorithms can also be extended to vehicle counting,cell counting,and plant counting.All these show that crowd counting has a wide range of research value and application prospects.Since the images of the dataset wer e taken in a dense scene,crowd counting has challenges such as significant variation in crowd scale,heterogeneous crowd distribution and uneven crowd density.The study of crowd counting in open scenes also has the problem of how to conduct more accurate and effective counting in complex scenes such as dark scenes.To address the above challenges,two algorithmic models are proposed in this thesis.(1)To address the problem of how to combine multi-source image data to improve the accuracy of crowd counting in complex scenes such as dark scenes,we pr oposed a Transformer-based interactive network IMMNet-T to improve the performance of crowd counting in complex scenes using the complementary nature of multimodal data.The feature extraction module of this network adopts Vi T encoder to extract features,and improves the encoding method to adopt sliding convolutional encoding to obtain richer embedded features in the encoding embedding stage.The feature extraction stage adopts intra-block crossover design to interact with multimodal features to extract more robust information.The extracted features are calibrated by the Token Attention module to assign weights to the extracted features,and then the MLP-based feature grouping alignment fusion module is designed to fuse the multimodal features,which achieves the full fusion between the features of different modalities and obtains finer features on the basis of maintaining the original feature mapping,and finally returns the output.In this thesis,extensive experiments are conducted on the latest multimodal dataset RGBT-CC to verify the complementarity of multimodal data,method and model component validity.This thesis also conducts comparative experiments on the Shanghai Tech RGBD dataset.The results show that IMMNet-T obtains good performance on the multimodal population count dataset,which reflects the feasibility and superiority of the algorithm.(2)A Transformer-based multi-level monitoring codec crowd counting network,MLMNet-T,is proposed to address the significant problem of crowd scale and densit y variation in crowd counting.A simple and efficient crowd counting network is constructed using Twins in the encoding phase to capture rich global information and multi-scale output.The decoding stage is responsible for fusing high and low semantics to obtain higher resolution output features.To obtain better counting accuracy,the algorithm is designed to monitor the training with multi-level supervised loss at the nodes where the encoder passes features to the decoder.Direct supervision of the key nodes in the network can improve the quality of the encoder output features and thus contribute to the improvement of the network counting performance.Finally,a multi-granularity feature aggregation regression head is designed for counting using null convolution with different null rates.In this thesis,comparative experiments are conducted on several popular datasets and the experimental results are competitive.Ablation experiments were conducted on two classical datasets,Shanghai Tech Part A,Shanghai Tech Part B,to verify the necessity of components in the network.The effectiveness and advancement of MLMNet-T is demonstrated by the experimental results and visualized. |