| The task of pedestrian re-identification(Re-ID)was originally a research and application direction in the field of computer vision.With the development of deep learning in the field of computer vision and the emergence of large-scale data sets,its related technologies have become increasingly mature and complete,and have been widely used in the field of domestic security monitoring.At present,as the domestic society puts forward more diversified needs for prevention and control of public security,and the continuous improvement of timeliness requirements for public security,single-modal pedestrian Re-ID can’t meet the application in some specific situations.In application scenarios where reference images can not be provided,the task of pedestrian Re-ID based on natural language description can largely replace singlemodal retrieval and has great practical application value.Therefore,this research topic has gradually become popular in recent years.This task aims to retrieve the most matching pedestrian pictures from the image database or videos according to the description sentences of natural language.It is a fine-grained task in the field of cross-modal image and text retrieval.Different from single-modal tasks,cross-modal tasks not only improve the feature extraction capabilities of the two modalities,but also focus on solving the fusion and alignment of the two modal features.Since the person re-identification task itself is finer-grained than the traditional image-text retrieval task,the information contained in the two modes of image and text will be more detailed to better distinguish the differences between people.Therefore,further strengthening the feature extraction capabilities of the two modalities will help to improve the final effect.In addition,the visual modality contains much richer information than the textual modality.In most cases,even if some areas in a visual image are removed,it does not affect the integrity of the overall information,but this is not the case for text.This granularity difference between visual features and textual features affects the alignment and matching of the two modalities.Aiming at the above problems,this paper designs an end-to-end pedestrian Re-ID model based on natural language descriptions:(1)The feature extraction of text and vision adopts the classic dual-branch architecture,uses the attention mechanism for feature extraction,and selects the encoders of the pre-trained models BERT and ViT as feature extractors.(2)Train and improve the feature extraction ability of both modalities through self-supervised learning of feature mask reconstruction.(3)The codebook technology is used to discretize the vector of visual features,and reconstruct the original features through a decoder,so as to train the coarsening feature ability of the visual feature extractor.(4)Align the two modal features using two loss functions,CMPM and CMPC.Our model is tested on the public dataset CUHKPEDES,and the results are compared with other baseline models,which proves the effectiveness of the model in this paper and shows strong competitiveness. |