| Semantic segmentation,as a fundamental task in computer vision,has become an important solution in many application fields,such as autonomous driving scene analysis and medical lesion image analysis.The training of strongly supervised semantic segmentation methods requires a large number of image annotations,while the pixel-level annotation of semantic segmentation is time consuming.To solve this problem,semi-supervised semantic segmentation has become a popular research area.Semi-supervised semantic segmentation utilizes a limited number of annotated samples,a large number of unlabeled images,and pseudo-label generation to train neural network models,which reduces the cost of labeling semantic segmentation labels.The current mainstream deep learning semantic segmentation approaches are to extract image features and complete pixel classification by convolutional neural networks.Recent studies have shown that the semantic segmentation method based on Vision Transformer outperforms convolutional neural networks.Meanwhile,the current semi-supervised semantic segmentation methods based on pixel-level contrastive learning have problems such as large computational costs and difficult sampling,which affect the effectiveness of pixel-level classification.To solve the above problems,this paper proposes a Transformer-based mask-level semi-supervised semantic segmentation method,and the main contributions are as follows:(1)To address the problem that the Transformer-based architecture requires a large amount of training data,this paper designs a pre-training method for the Transformer-based semantic segmentation decoder.Semantic segmentation annotations are generated by the attention matrix of Vision Transformer backbone networks in a self-supervised fashion on large-scale datasets,and the generated annotations are used to pre-train the Transformer-based decoder.(2)To address the problems of pixel-level contrastive learning in semi-supervised semantic segmentation,this paper designs a mask-based contrastive learning method,which includes Mask Contrastive(MC)loss and Mask Feature Contrastive(MFC)loss.The MC loss first divides the semantic segmentation annotations into multiple masks according to categories,and then completes the contrastive learning among the masks of different categories,which solves the problems caused by the wrong negative samples in the pixel-level contrastive learning.MFC loss uses masks to obtain features of each category and completes contrastive learning between category feature representations,avoiding the use of extra storage space to obtain category features and solving the problem of sampling difficulties.(3)In this paper,an Ensemble Mask Consistency(EMC)loss is designed.Different from the masks divided by semantic segmentation labels,the masks predicted by the neural networks might have multiple masks corresponding to the same category,so this paper combines the masks that represent the same category in the prediction process and completes the consistency regularization with the semantic segmentation annotation.(4)This paper conducts experiments on two widely used Pascal VOC and Cityscapes datasets.Experimental results show that both the self-supervised pre-training method and the mask-level semi-supervised semantic segmentation method proposed in this paper outperform the current state-of-the-art algorithms on publicly available datasets.Meanwhile,a large number of ablation experiments prove the effectiveness of the proposed methods. |