| With the improvement of three-dimensional data acquisition technology,the scene perception of the artificial intelligence is gradually expanded from two-dimensional space to three-dimensional space.As the key to scene perception,It’s becoming one of the research focuses to get Stable and accurate 3D object tracking in the scene in recent years.The point cloud is often used as a data source for 3D object tracking at present,however,due to the characteristics of point cloud itself,3D object tracking which only uses the point cloud as data source has many defects,using the image and the point cloud collaboratively through multimodal perception can improve the performance of 3D object tracking.In 3D target tracking based on multimodal perception,how to ensure the reliability of the information of the object extracted from the image and the point cloud as well as the effectiveness of information fusion method is the key to affect the 3D target tracking effect.The purpose of this thesis is to study the multimodal 3D object tracking algorithms,how to improve the efficiency of unimodal object information extraction is studied firstly,and then how to integrate image with point cloud is studied,the main research contents are as follows:(1)A novel 3D object tracking network is constructed.In this network,pyramid feature fusion is used to improve the point cloud feature extraction module to obtain multi-scale target features.Secondly,a feature fusion module is designed based on the attention mechanism to extract the feature points of the template and the search area.Then,multi-resolution clustering is used to cluster the voting points generated by the center voting of the fusion feature points.Finally,appropriate loss functions are set to constrain the network.(2)The motion information in 3D object tracking is analyzed.In 3D object tracking,the state information and the motion information of the object are helpful to distinguish the tracking object from the background.It’s helpful to extract the motion information of the object and import it to 3D object tracking for tracking the state of the tracking object.In order to extract the motion information of the target,this thesis adopts two methods respectively,the sequential embedding module which uses the motion of the previous several frames to estimate the motion of the current frame and the interframe information extraction module which aggregates the local motion information of the interframe data to estimate the overall motion trend,and applies them to the 3D object tracking.Experimental results show that both sequence embedding module and interframe information extraction module can extract target motion information effectively and improve the performance of 3D object tracking.(3)Three fusion methods of the image and the point cloud,data fusion,feature fusion and target fusion,are used for multimodal 3D object tracking.In data fusion,two methods are proposed: directly fusing the image data with the point cloud data to generate color point cloud as input;fusing the semantic segmentation of image data with the point cloud data to generate semantic point cloud as input.In feature fusion,a 3D object tracking network based on feature fusion of the image and the point cloud is designed.In target fusion,target mask generated by semantic segmentation is used to correct the 3D object tracking box.Through comparison experiment,the effect of multimodal fusion to the performance of 3D object tracking is verified.This thesis enriches the research in the field of 3D object tracking and contributes to the realization of intelligent scene perception... |