Video Object Segmentation(VOS)is a challenging computer vision task,the ideal result of which requires to fully utilize the spatial information in single frame and that between frames.The referencing objects detected are provided by visual saliency detection and the segmentation result,hence the unsupervised video object segmentation model can be built.Once determining the objects for segmentation,the main two factors of the performance of video object segmentation are as follows.One is the detection effect of the referencing frame in complicated scenes in VOS tasks,the other is the pixel-level algorithm designed for the match between the referencing frame and the current frame.Up till now,deep learning algorithms are successfully applied into the field of Salient Object Detection(SOD)and Video Object Segmentation(VOS),especially the extensive application of self-attention model.Nonetheless,the task of detecting referencing frame in VOS is difficult to achieve in consequence of the small size of salient objects,the blurred semantic information,and the low contrast ratio of the background,etc.Thereupon,research on the VOS algorithms inspired by visual saliency is of vital significance both theoretically and practically.This paper conducts research to explore the VOS algorithm inspired by visual saliency in depth,based on CNN series algorithms.The core idea of the paper is to build a VOS network model fully exploiting the advantages of self-attention model and multi-scale feature fusion model in order to further improve the performance of detection effect of the referencing frame in VOS tasks,namely the accuracy and the efficiency of the network.The work of this paper can be summarized as follows:(1)Developing an SOD network model integrating with self-attention mechanism.The cascade decoding network combines feature extraction network,feature fusing network and feature enhancement network,working in the way of transforming the feature map into lowlevel feature of spatial information and high-level feature of spatial information.It achieves the aim of fusion two different set of features and formation of self-reliance in the feature fusion network based on the self-attention mechanism.The complementary features are enhanced by multi-scale features,as evidenced by the enlarged receptive field.Furthermore,the network provides access to feature map via a cascade decoder including the loss function designed.Experiments conducted on five data sets demonstrate that the algorithm put forward can obtain better detection results than other popular SOD methods.(2)Focusing on the problem of finding referencing frame in complicated scenes during VOS tasks and the problem of matching the referencing frame with the current frame while maintaining a relatively high level of efficiency,the paper illustrates an elegant saliency-guided solution though saliency detection for foreground target,which provides the segmentation mask of the first frame,integrates with the match algorithm,and extracts pixel-level similarity information.Moreover,the model analyze the overall information via the famous transformer mechanism to offer a global guidance for the whole model.Experiments conducted on real data sets show that this algorithm can effectively improve the performance of the matching process and the VOS results. |