| Visual object tracking,which aims at detecting,extracting,identifying and tracking moving target in image sequences,so as to obtain the position and scale information of the targets in each frame,is one of the research hotspots in the field of computer vision,and has a wide range of applications in video surveillance,autonomous driving and machine perception.Single-mode object tracking,as an important research topic of visual target tracking,has developed rapidly in recent years.Classical algorithms in machine learning are introduced,and on this basis,a variety of single-mode object tracking algorithms based on different theoretical frameworks are proposed.Researchers are committed to improving the performance of these algorithms in both accuracy and speed.Although many breakthroughs have been made in single-mode target tracking,in some harsh or extreme conditions,such as in the tracking scene with poor lighting conditions such as low light,weak light and strong light,or in complex weather conditions such as rainy days and foggy days,the contour and texture information of the tracked target in the image cannot be well presented,and the tracking effect will be greatly reduced.In order to solve this problem,people introduced a thermal infrared mode on the basis of visible light mode.The combination of thermal infrared and visible light information enables the target tracking task to be carried out even in bad or extreme situations,and the tracking can be realized with high accuracy and real-time.Although infrared information makes up for the problem of unclear target outline in visible light images,when the radiation temperature of the tracked target and the background is similar,the infrared image will have thermal crossing phenomenon,which makes it extremely difficult to distinguish the target from the background.Visible light image can make up for the lack of texture information in infrared image because of its rich texture details,so the combination of visible light and infrared mode can effectively improve the target tracking performance under complex environmental conditions.In this paper,the RGBT object tracking algorithm based on multi-modal feature fusion is studied.The main contributions are as follows:(1)A multi-level fusion RGBT target tracking algorithm based on Siamese network(Multi-modal Multi-level Fusion Object Tracking Based on Siamese Networks,Siam MMF)is proposed.The full fusion of visible light and infrared information to achieve more robust feature representation is a core problem of RGBT target tracking.At present,in the RGBT target tracking methods based on deep learning,most of them only use a single pixel-level fusion or feature-level fusion method to fuse multi-modal information,while ignoring the performance improvement brought by the combination of multiple fusion methods.Siam MMF algorithm combines the advantages of pixellevel fusion and feature-level fusion,which can not only make use of more detailed information of the image,such as edge,texture,color but also improve the quality of fusion data and reduce the interference of noise.Without considering the modal reliability,the direct fusion of the information of the two modes may introduce too much noise,which is not conducive to the acquisition of robust multimodal features.In order to solve this problem,the proportional distribution of various modal weights is established,and a large number of experiments are carried out to compare the advantages and disadvantages of the adaptive modal weight calculation method and the fixed modal weight calculation method.In addition,some RGBT video sequence pairs with different properties(such as low illumination,occlusion,scale transformation,etc.)are selected to verify the improvement of Siam MMF tracking performance in complex environments.(2)A RGBT object tracking algorithm based on cross-channel local response normalization is proposed.How to obtain robust multimodal fusion features has always been the research focus of RGBT target tracking.The algorithm uses the improved VGG-M neural network as the backbone of the overall network architecture to obtain richer semantic information and more robust features.At the same time,a Dropout layer is added to the backbone network to avoid over-fitting.In the part of local response normalization,3D average pooling is used to process the feature data obtained by convolution across channels to extract the time information between adjacent video frames.Inspired by the fact that the implicit attribute information can improve the differentiation of the model,the algorithm divides the tracking challenge into three typical attributes,namely,extreme lighting,occlusion and thermal crossover,according to the appearance changes of the target in the tracking scene.Accordingly,the training set with these specific attributes is extended.In addition,an attribute-driven residual branch is designed for each attribute to mine the information specific to the attribute,so as to establish a powerful residual representation for the tracking target.In the part of model optimization,a binary classification loss and an instance embedding loss are introduced,and the two losses are weighted to achieve the overall optimization.A large number of experiments are carried out on the RGBT234 dataset and compared with some of the most advanced trackers.The experimental results verify the effectiveness of the proposed method. |