With the development of society and the increase of population,citizens’ demand for security continues to increase,but a large number of surveillance equipment brings huge data processing needs,and video surveillance systems play a huge value.Person re-identification technology is particularly important in the system,helping to improve safety and improve system performance,and is gradually becoming more widely used.Person re-recognition is distinguished by the type of recognition object,which can be divided into image-based person re-recognition and video-based person re-recognition.Video-based person re-identification can detect dynamic features of persons,such as gait,walking speed,etc.,while image-based ones focus on static features such as body shape,clothing,etc.In real scenarios,video-based person re-identification faces problems such as temporal and spatial misalignment,person posture changes,occlusion,and clothing similarity.In view of the above problems,this paper conducts research from three aspects: feature alignment,time series information modeling,and spatial feature extraction,and the main research contents are as follows:(1)Aiming at the problem of misalignment of the acquired images caused by the difference of different camera configurations and angles in real scenes,this paper proposes a feature alignment method based on Yolo V4 and graph matching.Firstly,a re-detect and crop module(DC)module is proposed,which can effectively cause some unexpected interference caused by inaccurate person detection results by applying the deep learning-based detection algorithm Yolo V4 on the clipping trajectory.Second,a Video Frame-Level Alignment(VFLA)module is proposed.The module uses the map matching method to locate the corresponding position of each point in the adjacent frame according to the cross-pixel similarity between the two frames to achieve feature map registration.Compared with the baseline network on the MARS dataset,the proposed method improves the m AP index by 1.6% and Rank-1 by 1.5%,which verifies the effectiveness of the feature alignment method proposed in this paper and lays a foundation for accurate extraction of person time series features.(2)Aiming at the problem that there are interference factors such as occlusion and complex background in some video frames,and the need to effectively obtain and utilize the temporal information in the sequence,this paper proposes a 3D convolution and Self-Attention Mechanism Fusion(3DSAMF)network,which combines shorttime 3D convolution and self-attention mechanism modules to extract temporal features and obtain a complete video timing representation.3D convolution has one more time dimension than 2D convolution to obtain effective short-term features,and the self-attention mechanism module can be used to process the entire sequence,which can capture the connections between different parts of the input sequence from the global to capture long-term features.This paper also proposes a hybrid pooling module,which can not only greatly preserve person detail information,enhance the most prominent features,but also preserve the global information of the image.The network combines the advantages of 3D convolution and self-attention mechanism to help improve the recognition accuracy of person identity in large-scale complex scenes.On the MARS dataset,m AP and Rank-1 indicators reached 85.2% and 89.2%,respectively,improving by 4.0% and 2.6% compared with the baseline model.The effectiveness of the network is proved by comparison with other time series models,which can improve the recognition ability of persons in complex scenes.(3)Aiming at the problem that there is a large amount of redundant information in continuous frames,which makes it difficult to distinguish persons with similar appearance,a multi-grained temporal complementary feature aggregation(MGTCFA)network is proposed.The network consists of global branch and local branch,which is built on a two-branch architecture,extracts multi-scale features,and uses a multi-stage fusion module(MSFM)proposed in this paper to fuse the features of different branches and different stages in the global branch to obtain a multi-stage feature containing both time and semantic dimensions.Local branches extract local fine-grained features,and fuse the two branch features to obtain more comprehensive person features.In this paper,the proposed method was evaluated on three common datasets: MARS,Duke MTMC-Video Re ID and i LIDS-VID,and a large number of experiments showed that the MGTCFA network performed well.On the i LIDS-VID dataset,the Rank-1 and Rank-5 indicators reached 93.3% and 99.3%,respectively,achieving the most advanced accuracy. |