Semantic segmentation is not only a fundamental computer vision task,but also a prerequisite for scene understanding and object recognition.It is of great significance to the development of medical image analysis,automatic driving,security monitoring and other industries.Collecting pixel-level labels is time-consuming and expensive,which limits the application of semantic segmentation.For these problems,weakly supervised semantic segmentation(WSSS)is proposed,which utilizes weak supervision to generate pixel-level pseudo-labels for segmentation tasks.Twostage WSSS via image-level classification labels focuses on generating accurate pseudolabels with class activation map(CAM).There are three problems: 1)Classification network locates the discriminative object regions,which causes the rough pseudo-labels cannot supervise semantic segmentation; 2)Inequality of classification networks is not suitable for segmentation tasks,which affects model robustness; 3)There are both over-activated and underactivated regions in CAMs.This paper studies these problems,as follows:For the first problem,this paper proposes a multiscale feature fusion network(MFFN),which can effectively obtain the different size context informations to expand CAMs.Specifically,the network uses dilated convolutions with different receptive fields to simultaneously extract features.Then,it conducts an average operation over the multiscale CAMs generated by different dilated convolutions.The averaged map is added to the initial CAM for highlight the object regions.The m Io U value of pseudo-labels is 4.01% higher than Res Net50;For the second problem,based on the idea of SSENet,this paper uses affine equivariant regularization to guide the whole network to learn more accurate CAMs.Specifically,The backbone network of siamese framework is MFFN.Calculate the distance between original image and affine transformed image corresponding to CAMs,which is used as an optimization function to guide the whole network to approximate affine equivalence.The m Io U value of pseudo-labels is 0.83% higher than the multi-scale feature fusion network;For the third problem,this paper proposes a class activation map fine-tuning module,which optimizes the expended CAMs by pixel correction.Specifically,Build long dependencies for shallow features based on cross-attention,which captures context appearance information for each pixel and revises CAMs by learned affinity attention maps.The m Io U value of final pseudo-labels has increased by 6.08%.Experiments on the PASCAL VOC 2012 dataset show that this paper proposed weakly supervised semantic segmentation method has achieved advanced result. |