| Deep convolutional networks and fully convolutional networks have made significant progress in the task of semantic segmentation of images.Image semantic segmentation determines the class of each pixel in an image.Most existing semantic segmentation models rely heavily on labeled data,but the cost of annotating semantic segmentation data is higher than that of classification data.Therefore,in the case of a few support images and their corresponding segmentation masks,few-shot image semantic segmentation can effectively obtain the semantic category of objects in query images.However,the converges of a model learning under few-shot data is still a big challenge.To align the spatial information of support and query objects,the parameter structure-based methods attempt to find similarity relationships between support and query objects,while the prototype-based methods try to calculate the center of each sample category and compare the distance between pixels and each center to determine the pixel category.However,these methods also have some overlooked questions.For example,spatial changes between support sets and query sets may exceed their similarity range,causing the model to fail to make accurate inferences.Additionally,in some complex scenes,inconsistent spatial information between support and query sets may involve the relative positions and postures of multiple objects.Therefore,the following challenges still exist:(1)In few-shot image semantic segmentation,the attention is primarily given to the current support class,potentially misclassifying new classes as background to cause feature degradation.Additionally,different levels of features require the specific handling during the feature extraction process.While intermediate features capture the position and contour of objects,they lack fine details.On the other hand,high-level features contain specially information about objects,but they may lose edge details.(2)Due to the inability of few-shot data to accurately simulate class statistics,the current prototypes for support estimation is obscure.Prototype learning inevitably results in the loss of spatial information,leading to prototype bias.(3)Information extraction from foreground support pixels for query segmentation ignores all background support pixels,although the background may contain base classes closely related to the query objects.these background pixels ignoration introduces prior biases.To address these issues,this study proposes the TSNet*(Taking Spatial Network*)model,which includes a feature generation process,SC(Strip Channel)module,AT(ASPP Transformer)module.TSNet model is incorporating a base learner and ensemble module.The TSNet*and TSNet model aims to solve the following problems:(1)To address the issue of feature degradation and the loss of beneficial support information,the TSNet*model learns meta-knowledge within a generalized few-shot semantic segmentation framework.It extracts embedding information from both ground truth masks and pseudo masks,allowing for the exploration of new classes hidden in the training data.During the feature generation process,different levels of features can be processed in different ways.The feature fusion leverages the intermediate or high-level features from the backbone network to enhance the model’s generalization performance.(2)For addressing the loss of spatial semantic information in prototype learning,the SC module,composed of stripe pooling and channel attention,is proposed to handle the intermediate features in the backbone network.Furthermore,the TSNet model considers background objects that may be overlooked by the prototype vectors.The base learner utilizes additional background samples to train a base classifier,capturing background information more effectively.The ensemble module then combines the outputs of the base learner with the prototype vectors to reduce the impact of prototype bias on the final results.(3)To mitigate the prior bias introduced by the prior masks,the TSNet*model first applies outlier processing to enhance the quality of the prior mask features.This helps guide query image segmentation and improves the performance of few-shot image semantic segmentation.Additionally,the AT(ASPP Transformer)module is introduced to concatenate the multi-level dilated spatial pyramid pooling and Transformer,effectively reducing the influence of prior biases and enhancing the segmentation results for query images.Experimental results demonstrate that the proposed TSNet*model and the TSNet model achieve state-of-the-art performance in few-shot image semantic segmentation tasks. |