| With the development of computer technology,computer vision is gradually widely used in various fields such as video surveillance,video understanding,etc.In order to grasp and extract useful information,it is quite important to automatically segment the videos without human laboring.In addition,since human is the key information in most videos,it is necessary to segment them all from the videos.Therefore,human video instance segmentation task,that combines multiple visual tasks such as detection,segmentation,and tracking for the category of people instance,is of great significance for handling video analysis task.In addition,the current task also includes many difficulties,such as the reappearance of the target after long-term disappearance,target occlusion,motion deformation,or light and shadow changes.All these will increase the difficulty for video instance segmentation.Motivated by the above analysis,this paper proposes a video instance segmentation algorithm based on the single-stage instance segmentation method SOLO.The main contributions of the proposed work can be summarized as follows:(1)Due to the current lack of datasets specifically for humans,the PVIS dataset is collected for evaluation after screening and cropping based on multiple existing video datasets and newly collected datasets.The results on the new dataset demonstrate the robustness of the proposed algorithm in complex video scenes.(2)Following the detection and tracking mode,we have added the appearance feature extraction module and data association module based on instance segmentation method SOLO,so that the framework can realize the human detecting,tracking and segmenting,simultaneously.In addition,by extracting the appearance features on different scale feature layers of the backbone network,the proposed framework can also handle the identity switch problem after the same instance size changes;by sampling points determined by the maximum contour centroid sampling strategy,the human occlusion problem can be solved so as to enhance the segmentation accuracy;by adopting data association strategy,the matching of the same instance between different frames can be realized so that the human in the video can be smoothly tracked.(3)Aiming at solving the problem of long-term tracking for some large-area occlusion or recurring instances,a space-time memory module based on the mask propagation mode and the SOLO grid structure is proposed,which can reduce the accumulate errors in the mask propagation process by storing the previous part of the frame information,as well as extend the tracking to the entire video sequence.At the same time,the selection module is introduced to integrate the results of the two types of modes,so that higher quality segmentation result in the video sequence is obtained,which greatly improves the segmentation accuracy. |