Person re-identification(re-ID)aims to correctly match pedestrians of interest from a large corpus captured by multiple cameras.It has extensive applications in the fields of intelligent safeguard systems,intelligent transportation,and intelligent policing.In recent years,the rapid development of deep learning and the improvement of computing devices have made person re-ID a remarkable achievement.However,the surveillance scenes in person re-ID are complex and diverse.The occlusion,blur,changes in the background and clothing bring great challenges to existing methods.To better cope with the complex environment and improve the robustness and discrimination capability of the person re-ID system,this thesis focuses on multi-cue information fusion including spatiotemporal information fusion,foreground and background information fusion,and visual and wireless information fusion.The research contents and contributions contained in this thesis mainly include the following four aspects:First,in the fusing of temporal and spatial information,this thesis proposes a refining recurrent unit(RRU)to fuse the spatiotemporal information to refine features of frames and improve feature quality.For person re-ID,features extracted from some frames may suffer from visual noises such as occlusion and blur,which cause features to be polluted by noise.However,the information redundancy between video frames and the motion information contained in the video can enable the content of one frame to be restored by referring to the information of adjacent frames.To model the above properties of videos,this thesis designs a refining recurrent unit(RRU),which can refine the features of each frame by the appearance changes and motion information of pedestrians between frames,which reduces the effect of noise.The model using the proposed RRU in this thesis achieves leading performance on existing datasets.Second,in the fusing of foreground and background information,this thesis designs a siamese foreground and background fusion learning framework to improve the model’s capability to distinguish between foreground and background.The monitoring scenes are complex and changeable,which makes it difficult for the model to distinguish the foreground and background.This interferes with the extraction of pedestrian features.To this end,this thesis proposes a siamese foreground and background fusion learning framework,which uses the duality of pedestrian identity and camera identity to guide the two branches to extract foreground and background features,respectively.Based on the complementarity of the focused regions of the two branches,a target enhancement module is proposed,so that the two branches can interact to constrain and promote each other.Extensive experimental results and visualization results show that the proposed method effectively distinguishes foreground and background,and achieves leading performance on multiple datasets.Third,in the fusion of visual and wireless information,this thesis proposes a multimodal person re-ID framework based on context propagation,which improves the robustness and performance of the person re-ID system by combining visual information with the wireless positioning information of mobile phones.Visual data is easily affected by visual noises such as occlusion and clothing changes and becomes unreliable.However,the wireless positioning signals of mobile phones are robust to visual noises.Accordingly,this thesis proposes a new task containing person re-ID and signal matching using visual data and wireless positioning data and designs a multimodal person re-ID framework under full-scene labeling.It relies on a recurrent context propagation module to fuse visual and wireless information and uses multimodal data to train a person re-ID model with the help of an unsupervised multimodal cross-domain method.This framework integrates the respective advantages of multimodal data to improve the reliability and performance of the system.It outperforms existing methods on multimodal datasets with a significant performance improvement.Finally,in the fusion of visual and wireless information,this thesis further proposes a multimodal person re-ID framework based on graph neural network,which significantly reduces the data annotation overhead with comparable performance compared to methods based on full-scene labeling.The method based on full-scene labeling associates multimodal data by labeling the latitude and longitude of the entire monitoring scene,which improves the reliability of data association,but introduces a lot of data annotation overhead.To this end,this thesis proposes a new framework under weak scene labeling,which only needs to label the locations of cameras.It uses a multimodal data association strategy to associate visual data and wireless data and integrates multimodal information through a multimodal graph neural network.This method outperforms existing visual methods on multiple datasets.Meanwhile,it significantly reduces data annotation overhead and achieves comparable accuracy to full-scene labeling methods. |