Font Size: a A A

Research On Pedestrian Detection Algorithm Based On Vision Transformer

Posted on:2024-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LvFull Text:PDF
GTID:2568307058476124Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Pedestrian detection is a classic problem in the field of computer vision,and its performance determines the development of various human-centered visual technologies.How to design pedestrian detection algorithms with higher performance and stronger robustness has always been a research topic of significant practical significance.With the development of deep learning,researchers have proposed many pedestrian detection algorithms based on convolutional neural networks,which have achieved significant improvements in detection accuracy and speed.In recent years,with the development of Transformer from the field of natural language processing to the field of computer vision,it has shown strong performance in image recognition,object detection,and other visual tasks under its global attention mechanism.The use of Vision Transformer for pedestrian detection has become a hot direction worthy of research.In response to the shortcomings of the general object detection algorithm DETR in pedestrian detection,from the perspective of autonomous driving applications,an improvement was made based on DETR,and a completely Transformer pedestrian detection algorithm CF-DETR was proposed.The main research work is summarized in five aspects.(1)Network architecture design.The algorithm follows the end-to-end detection design of the DETR detector,simplifying the pedestrian detection process into a set prediction problem.The set-based global loss function is used to force unique prediction through binary matching and the encoder-decoder architecture of Transformer,which simplifies the detection process.(2)Encoder design.In DETR,due to the limitation of computational complexity,the processing object of the encoder is the image features extracted through convolutional neural networks.In CF-DETR,the Transformer encoder is used to directly process images,and the window attention in Swin Transformer is used to solve the problem of high computational complexity.(3)Decoder design.CF-DETR uses a decoupling method,that is,the target query in the decoder is decoupled into two parts: content query and Spatial query,which improves the convergence speed of algorithm training.In addition,to provide better positional priors for the model,modifications were made to the mutual attention section of the decoder to further enhance the detector performance.(4)Algorithm validation.To explore the effectiveness of CF-DETR algorithm improvement,comparative experiments,and ablation experiments were conducted on multiple data sets.The feasibility and effectiveness of the algorithm improvement were verified through experimental data validation,and the detection effect was visually displayed through visual analysis.(5)Research on robustness and generalization.Deeply explore robustness and evaluate the robustness and generalization of algorithms through cross-dataset training and testing.Conduct multiple experimental explorations on this and use progressive training with multiple datasets to further improve the performance of the detector.After a series of research work,some achievements and innovations have been achieved as follows:(1)A more holistic network structure design.The original DETR used convolutional neural networks for feature extraction and then used Transformer to process the features.It only used the high-level features of convolutional neural networks,which can easily cause the loss of detailed features.CF-DETR directly uses the Transformer to process the original image,benefiting from the Transformer’s global attention and enabling the network to better grasp global and detailed features,further improving network performance.Abandoning convolutional neural networks and using Swin Transformer encoders to directly process images further simplify the network structure and enhances the integrity of the algorithm’s network structure.(2)Network design that is easier to converge.Due to the dependency of classification and localization tasks on target queries in DETR,higher requirements are placed on the quality of target queries,which requires more rounds of training for the network to converge.CF-DETR decouples the target query into two parts: content query and location query,responsible for the classification and localization tasks respectively,significantly accelerating the convergence speed of the network.In addition,to provide better positional priors for the detection network,proportional information is injected into the attention map to further accelerate network convergence.(3)Further improve the robustness and generalization of the network.By analyzing the training and testing results of existing algorithms on cross datasets,a larger and more diverse pedestrian detection dataset is used for pre-training,and a progressive fine-tuning strategy is used to achieve the stronger performance of the network and further improve the accuracy of detection.
Keywords/Search Tags:Pedestrian Detection, Vision Transformer, Transformer Encoder, Transformer Decoder, Robustness
PDF Full Text Request
Related items