With the advances of deep learning in the field of computer vision,deep learning-based video object segmentation methods have undergone unprecedented development.As one of the important tasks in the field of computer vision,video object segmentation aims to annotate each frame of a video sequence at the pixel level,assigning each pixel to its corresponding object category,thereby achieving accurate extraction and tracking of each object in the video.The difficulty of video object segmentation lies in the challenges posed by changes in object appearance,object similarity,object occlusion,and object movement.At the same time,algorithm efficiency and accuracy must be considered to strike a balance between real-time performance and accuracy,in order to achieve efficient object segmentation.Existing video object segmentation methods still suffer from the drawbacks of slow processing speed and insufficient accuracy,making it significant to design precise and efficient video object segmentation methods.The dissertation analyzes and studies deep learning-based video object segmentation methods starting from the segmentation challenges.(1)A video object segmentation method based on a U-shaped network architecture is proposed to address the limitations of the One-Shot Video Object Segmentation(OSVOS)algorithm,which struggles with scenes featuring object appearance changes and similarities.This method establishes correlations between feature maps using attention mechanisms to improve the model’s global semantic information.During training,imbalanced positive and negative samples can lead to inaccurate predictions,and this issue is resolved by optimizing the loss function.Due to the correlation between pixels,segmentation results often have rough edges.To address this,the dissertation applies a fully connected conditional random field to post-process the multi-scale prediction results,which effectively improves the accuracy of boundary segmentation.(2)The Separable Structure Modeling for Semi-Supervised Video Object Segmentation(SSMVOS)has weak modeling capabilities and cannot effectively segment occluded and fastmoving objects.To address this issue,this dissertation proposes a video object segmentation method based on a hybrid encoder of Convolutional Neural Networks(CNN)and Transformer.The proposed method associates the global convolutional module with Transformer,which not only alleviates low-resolution loss but also better models long-term dependencies and global information in the sequence.Additionally,the boundaries in low-resolution images are usually blurry,and this dissertation proposes an attention feature fusion boundary refinement module to accurately locate the boundaries.The proposed method has the dual advantages of Transformer and CNN and has made significant progress in solving segmentation problems such as occlusion and fast movement. |