| As the evolution of technologies such as artificial intelligence,computer hardware and so on,the field of computer vision is increasingly employing deep learning to address associated issues.Image semantic segmentation,a fundamental element in computer vision,has been utilized in a multitude of areas,such as intelligent transportation and medical diagnosis,to name but a few,so the importance of research into image semantic segmentation is immense.The structure of the indoor scene is relatively complex,and RGB images only provide color information,the situation will blur the boundaries between objects which have similar colors.While depth image can provide corresponding geometric relationship for RGB image and retain spatial information of object.Combined with depth image,segmentation effect can be effectively improved.Therefore,semantic segmentation based on supplementary RGB and depth image has gradually emerged as a popular research direction in the issue of image processing.The focus of this paper is the investigation of semantic segmentation technology for indoor scenes,utilizing RGB-D images.The main research components are as follows:(1)Study the relevant theoretical knowledge of deep learning applied in RGB-D image segmentation,analyze the current problems that existing in multi-modal fusion as well as multi-scale fusion for RGB-D semantic segmentation,and carry out research on the above problems;(2)The differences and complementary nature between RGB and depth images are a cause for concern,an attention-guided multi-mode cross fusion segmentation network(ACFNet)is proposed to effectively integrate the two modes.Firstly,the network adopts the encod-decoder structure,and an asymmetric dual-stream feature extraction network was designed,while a global-local feature extraction module(GL)was added to RGB encoder;Secondly,in order to effectively fuse RGB and depth features,an attention-directed multi-modal cross fusion module(ACFM)is proposed to better utilize the enhanced feature representation of fusion in multiple stages.Experimentation has demonstrated that the ACFNet network significantly improves the segmentation outcomes of indoor scenes;(3)An RGB-D Semantic segmentation network(EMFNet)that fuses multi-scale features which are extracted from encoder is proposed,the aim is to tackle the issue of varying dimensions of target items in indoor settings.First and foremost,a PMFM,incorporating pooling operation,is proposed to capitalize on the multi-scale features acquired during the encoder stage;Secondly,a multiple skip connection module(MSCM)is designed,this allows reusing the details which lost during the process of down-sampling.Experimentation has demonstrated that EMFNet surpasses ACFNet and the other semantic segmentation techniques when compared. |