| In today’s society,multi-object tracking plays an important role in visual tasks and can be seen everywhere in people’s lives,such as in video surveillance,modern automatic driving,and other practical applications.At present,mainstream online multi-object tracking can be divided into the two-stage network of tracking by detection and the one-stage of joint detection and tracking.The tracking by detection network mainly extracts the target location information based on the detection model and then extracts the association information from the target location through the feature extraction network.The matching between the targets is realized through the correlation information so as to achieve the final tracking.The joint detection and tracking networks realizes the extraction of target location information and target association information through the same backbone network and then realizes the tracking through the corresponding matching algorithm.The main research content of this paper is based on the onestage network of joint detection and tracking.The main research contents are as follows:1.A multi-object tracking model of feature decoupling and the joint optimization branch is proposed.In view of the optimization conflicts in the existing joint detection and tracking network paradigm that the final feature map is obtained through the backbone network for detection and feature extraction subtask,the corresponding improvement is made.The feature extraction branch mainly focuses on enlarging the intra-class variance,while the detection branch focuses on enlarging the inter-class difference and minimizing the intra-class variance.Using the same feature map for two tasks will inevitably affect the learning of specific representations,thus affecting the tracking performance.This chapter optimizes the two branches independently by decoupling them.By preserving the shared information needed between branches and combining it with the independent information generated by the specific branch,the enhanced embedding representation of the corresponding specific branch is generated,so as to realize the decoupling of a feature map.Finally,the learning ability of the network is enhanced and the tracking performance is improved.2.A multi-object tracking model based on feature purification and trajectory filling is proposed to improve the interference problem of discriminant feature extraction due to the fact that the existing convolutional networks aggregate all the information but ignore the useless background information.Under the current joint detection and tracking network paradigm,the identity features of the object are obtained by continuously aggregating the surrounding information through a convolutional network.As the receptive field deepens,more background information is continuously captured.Such background information can interfere with the discovery of discriminative features of the target.This chapter performs a feature purification operation that employs a countable number of generalized features learned in the original dataset to guide the extraction of discriminative features for each of our objects,ultimately enabling longer-range tracking.At the same time,for the phenomenon that the trajectory of the target is interrupted due to occlusion,the Kalman-filter-based method is used to model the previous motion of the target.When the target is lost,the detection position of the target is recovered through the method,which realizes the online filling of the trajectory.3.A multi-object tracking model based on the fusion and enhancement of inter-frame motion information is proposed to improve the existing joint detection and tracking network paradigm which focuses on single-frame input but ignores the temporal information generated by inter-frame motion.This chapter introduces inter-frame pixel-level motion information to enhance the generation of object features while making full use of spatial information.The information about the inter-frame difference of the feature map is extracted from both spatial and channel perspectives,and the corresponding original feature map is supplemented with motion information at the spatial and channel levels.At the same time,the acceleration strategy of retaining the inference results of the previous frame feature map is introduced,which greatly reduces the inference speed of the model,fully improves the inference speed and meets the requirements of real-time. |