The purpose of pedestrian re-identification is to retrieve a specific pedestrian from surveillance images acquired by non-overlapping cameras,but most of these tasks are trained on ideal closed-world datasets,so models tend to be poorly applied in real-world scenarios.In recent years,researchers have begun to turn their attention to open-world pedestrian re-identification tasks based on non-ideal scenarios,the most representative of which is cross-modal pedestrian re-identification,the purpose of which is to match probe data with target data with different modes.At the same time,according to different tasks,cross-modal pedestrian re-identification can be divided into cross-modal pedestrian retrieval from visible data to infrared data,visible data to depth data,visible data to sketch data,and visible data to text.Obviously,the imaging mechanism of different modal cameras is very different,which makes the modal gap between the captured pictures difficult to cross,so that the features of different identities of the same mode are more similar than the features of different modes but the same identities.Therefore,in the current cross-modal pedestrian re-identification works,the most challenging thing is how to use deep learning networks more effectively to solve the modal difference and design a network with the highest recognition accuracy as possible.Therefore,this paper disassembles the above problems,and analyzes and summarizes the model architecture and theoretical basis of related networks from a mathematical point of view.At the same time,based on the analysis results,this paper determines a new idea to solve the problem of cross-modal pedestrian re-identification,and proposes an effective new cross-modal pedestrian re-identification network model:(1)According to the different methods of eliminating modal differences in existing works,this paper divides the previous cross-modal pedestrian re-identification works into methods based on picture style transformation,cross-modal shared feature mapping,and modal invariant feature assistance,and points out the working principles of various models from a mathematical point of view.In addition,by comparing the above works,this paper finds that although the deep learning models based on a overall design of the network architecture can show better performance,the black-box learning process makes it less interpretable and generalized.Conversely,deep learning models backed by mathematical theory tend to exhibit more robust performance.Therefore,this paper summarizes the relevant mathematical theories commonly used in the field of pedestrian re-identification,and is inspired by them to derive the direction of the next work.(2)An "Intermediary-guided Bidirectional Spatial-Temporal Aggregation Network for Video-based Visible-Infrared Person Re-Identification" is proposed to complete the cross-modal pedestrian re-identification of infrared data to visible data(or,visible data to infrared data).The network consists of "Intermediary-guided feature learning module" and " bidirectional spatial-temporal aggregation module ".The former uses the low modal correlation of the intermediary feature map to guide the backbone network to extract modal robust features,and applies cross-reconstruction to the intermediary feature map to further eliminate the modal information.The latter assists in identification by extracting modality-independent temporal information(e.g.,gait)that is only relevant to pedestrian identity.At the same time,this paper adds "easy sample-based loss" to the output features of the network to assist the work of triplet loss,which solves the problem of submodal separation that is ignored in traditional works,and further improves the cross-modal retrieval capability of the network.A large number of experiments show that the "Intermediary-guided Bidirectional SpatialTemporal Aggregation Network for Video-based Visible-Infrared Person ReIdentification" is superior to the most advanced methods in various settings.Compared with the latest method,the proposed method achieves the highest retrieval performance in the world on two main tasks of VCM dataset,and the proposed method achieves the highest retrieval performance in the world on the two main tasks of VCM dataset,and the recognition rate of Rank-1/m AP exceeds 1.29%/3.46% in the infrared to visible retrieval scenario and 5.04%/3.27% in the visible to infrared retrieval scenario,respectively,which also verifies the value of the proposed method. |