Video object segmentation is an important task in computer vision,which enables continuous tracking and segmentation of arbitrary objects in videos.This task has extensive and significant applications in video editing,autonomous driving,video conference,video surveillance,etc.It can also serve as a fundamental technology for other video understanding tasks such as video summarization and action recognition,and has extensive and important research value.Compared with static image segmentation,annotations in the initial frame can be provided in video object segmentation to specify the target to be predicted.Additionally,due to the extra time dimension,it is necessary to consider the variation and correlation properties of the target between video frames.However,there are interferences from the background and similar objects in the video frame,necessitating more discriminative target representation in the spatial domain.Rapid motion and morphological changes of the target across video frames pose a challenge to robustness of the temporal information aggregation.Moreover,the complexity and coupling of video contexts require the model to have more effective spatio-temporal feature encoding and decoding capabilities.To address these challenges,this thesis conducts targeted research at three levels:target feature extraction in the image space,temporal information aggregation for multiple frames,and video object segmentation by spatio-temporal feature fusion.1.Target feature extraction in the image spaceExtracting key target information from complex image content and constructing diverse feature memories are important foundations for accurate video target segmentation.In this thesis,the researches on target feature learning are conducted under semisupervised and unsupervised settings.Under the semi-supervised setting,the pixel-level annotation of the first frame can provide target information,but only using pixel-level features lacks the ability to discriminate the target as a whole,and is easily disturbed by local background and similar targets.To this end,this thesis proposes a prototypebased method for multi-level target feature learning,which generates prototypes to represent different targets and different parts of the target.A hierarchical target memory is constructed with pixel-level,part-level,and instance-level features.Then prototype-toprototype,prototype-to-pixel,and pixel-to-pixel matchings are built to achieve target predictions with better discrimination and multi-scale adaptability.In the unsupervised setting,due to the lack of first frame annotations,it is necessary to use global structural prior information to enhance the feature representation of the target.Considering the blurred edges in local areas and the relatively stable structure of the overall target,this thesis proposes a segmentation framework based on semantic embedding and morphological prior.The semantic embedding module combines the high-level structural information with the low-level texture details.The training method with the signed distance field label enhances the model’s perception of the target edges.These designs effectively improve the morphological rationality of the target segmentation results.2.Temporal feature aggregation from multiple framesMore robust target representations can be achieved by gathering information from multiple reference frames.However,the scarcity of prior annotations and complex scene changes make it difficult to aggregate the inter-frame information.Especially for the task of weakly-supervised video object segmentation,only the bounding box annotation is provided in the initial frame,which is more likely to cause wrong object matching.To this end,this thesis proposes a two-stage target feature re-aggregation framework.The first stage conducts a coarse matching procedure to obtain the target prior of the query frame and achieves inter-frame target information alignment.Then the second stage performs refined matching and aggregation.Furthermore,the traditional unidirectional aggregation from the reference frame to the query frame lacks the verification of target information.To address the problem,a bidirectional temporal aggregation mechanism is designed to achieve interactive verification of inter-frame information at both pixel and channel levels,and enhance the focus on target regions that co-occur between frames.To ensure the effectiveness of information aggregation,this thesis further proposes a cross-task distillation strategy from a semi-supervised teacher model to a weakly-supervised student model,which provides a reliable supervision mechanism for temporal feature learning under the weakly-supervised setting,thereby fully aggregating inter-frame information and obtaining more robust segmentation results.3.Spatiotemporal feature encoding and decodingThe visual appearance and motion correlation properties of video objects are diverse and coupled with each other,making it difficult to fully represent and utilize video information from a single temporal or spatial dimension.This thesis studies the encoding and decoding processes of spatio-temporal target information.In terms of encoding,an algorithm is first designed through the mitotic cell detection task in medical imaging and then extended to the video object segmentation task to verify its feasibility.Traditional single-stage end-to-end training lacks effective optimization for the encoding process of spatial appearance and temporal motion.In this thesis,a pre-training strategy based on auxiliary tasks is proposed to provide explicit supervision for the appearance and motion encoding networks,so that the model has a stronger ability to perceive the spatio-temporal target information.In terms of the spatio-temporal decoding,existing methods adopt a decoding network with fixed parameters that lacks specific adaptability to different targets.This thesis proposes to predict conditional convolution parameters based on spatio-temporal target features.A global-local dynamic guidance module and a dynamic semantic embedding module are designed to make full use of the spatiotemporal information and guide the processing of target features.This constructs a multi-stage dynamic decoding process that varies with the input target,achieving both efficient and accurate predictions for video target segmentation.Through the above three parts of the research,this thesis comprehensively analyzes the technical processes of video object segmentation and proposes solutions from different aspects to tackle the difficulties of the task and the problems of existing methods.This work helps to improve the discriminability of target features,the robustness of temporal aggregation,and the sufficiency of spatio-temporal information utilization,which have made some progress in theory and application. |