Font Size: a A A

Research On Unsupervised Video Object Segmentation Algorithm Based On Fusion Of Motion And Appearance Informatio

Posted on:2024-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y C WangFull Text:PDF
GTID:2568307106975999Subject:Electronic information
Abstract/Summary:PDF Full Text Request
The Video Object Segmentation task aims to separate the primary object of interest from the background in video data and track it over time.Among them,unsupervised video object segmentation tasks aim to locate and accurately segment foreground objects in videos without requiring manual annotation of the ground truth segmentation masks of the first frame during testing.Due to not requiring additional manual annotation information during testing,unsupervised video object segmentation has a wider range of applications and greater flexibility,and has attracted widespread attention in fields such as video understanding,autonomous driving,and video editing,with significant research significance.This article focuses on the unsupervised video object segmentation algorithm based on the fusion of motion and appearance information and has achieved the following results:(1)An unsupervised video object segmentation algorithm based on the fusion of discrete cosine transform frequency domain features.In response to the insufficient interaction between appearance and motion information in the current algorithm’s dual-stream network,a lightweight unsupervised video object segmentation network based on the fusion of discrete cosine transform features is proposed.By using the characteristic that the importance of features in different frequency components in the frequency domain is different,a discrete cosine transform feature fusion module is designed.First,the features of the two modalities are cross-fused in the spatial domain,complementing each other.Then,the features are transformed to the frequency domain through discrete cosine transform,and the importance weights of the frequency domain components of the fusion features are learned in the frequency domain,enhancing the features in the frequency domain.In response to the problem of high computational complexity caused by large receptive fields in the current algorithm,a large kernel convolution context semantic guidance module is designed to decompose the large receptive fields into spatial local convolution,spatial distant convolution,and channel convolution,effectively reducing the computational complexity while having the contextual semantic learning ability of a large receptive field.In addition,global semantic information is used to guide the upsampling process in the decoding stage,gradually aggregating the multi-level features enhanced in the frequency domain to obtain more accurate segmentation results.Finally,extensive experiments and evaluations were conducted on the DAVIS2016,FBMS,and DAVSOD datasets,and the experimental results fully verified the effectiveness of this method.(2)An unsupervised video object segmentation algorithm based on the quality estimation of motion information.In order to better utilize motion information in the dual-stream network,based on the previous research work,in the interaction between motion and appearance information,different quality motion information’s different impacts on the algorithm were further considered,and an unsupervised video object segmentation algorithm based on the quality estimation of motion information was designed.Low-quality motion information not only cannot provide supplementary information to appearance information to enhance segmentation performance but also contaminates appearance features.Existing methods do not explicitly model the contribution of motion information quality to the fusion features of the dual-stream network.In order to adjust the participation level of motion information of different qualities in the fusion stage with appearance information,a motion information quality estimation module is designed to explicitly model the importance weight of motion information of different qualities in different encoding stages of the algorithm.On the basis of the previous research work,a frequency domain global filtering module is constructed by utilizing the characteristic of the frequency domain.Through the principle that frequency domain multiplication is equivalent to spatial domain convolution,the global dependency relationship of features is learned in the frequency domain through learnable weights,replacing the selfattention module and reducing computational complexity.Finally,through sufficient experiments on the DAVIS2016,FBMS,and DAVSOD datasets,it is proved that the proposed method outperforms existing state-of-the-art methods.
Keywords/Search Tags:Unsupervised video object segmentation, Discrete Cosine Transform, attention mechanism, frequency domain analysis
PDF Full Text Request
Related items