| The image is one of the most prevalent information.The evolution of electronic camera recording devices and the rapid development of the internet have led to an exponential explosion of image data,making it difficult for humans to process the tens of trillions of image content alone.The task of image-related computer vision has become increasingly widespread and deeply integrated in society’s daily lives.It has played a crucial role in areas such as medical imaging,automatic driving,and video processing.Instance segmentation is a basic task in the computer vision field.Given an image,the objective of instance segmentation is to distinguish between instances and background based on pixel-level.For instance segmentation tasks,it is often difficult to extract effective features,due to the low efficiency of image feature extraction,and the inability to balance short-range and long-range features.This results in the generation of insufficiently informative feature maps that cannot fully reflect the diverse information of different objects in the image.In addition,there may be overlap and occlusion between objects,resulting in lower effective and inaccurate boundary segmentation,and a low masking efficiency and poor accuracy.This dissertation proposes a new feature extraction network to address the issues described above and introduces a new mask generation method.It has been demonstrated on various datasets to achieve better results.This dissertation’s main contribution is as follows:This dissertation proposes a new feature extraction network based on convolutional neural networks and self-attention.This network leverages the advantages of short-and long-range features extraction and sliding window algorithms,integrating CMT modules and Swin Transformer modules.The local perception unit proposed by the CMT module enhances the ability to extract local information,while using lightweight multihead self-attention module and a reverse residual feedforward network to improve the efficiency of the network and achieve better results.The Swin Transformer module utilizes sliding window mechanisms to address the limitations of windowed multi-head selfattention mechanisms in lacking interactive windows,thereby enhancing global feature extraction capabilities.To evaluate the effectiveness of this method in instance segmentation,experiments were conducted on COCO and Cityscapes datasets.The observation is that this method performs better than the basic model not only in many segmentation metrics,but also on the COCO dataset.The segmentation results of large,medium,and small objects are all improved by over 3%compared to the base MaskRCNN model.Moreover,it can be concluded from the segmentation results that the results are also better in the areas of interest recommendation and input data validity.In the Cityscapes dataset,the segmentation results of different depth levels are significantly improved compared to the base MaskRCNN model.The experiments on different datasets validate the effectiveness of this method.To address the low efficiency of generating masking with large-scale object dataset,a masking generation method based on fast cosine transform was introduced.The method combines the advantages of low complexity and high quality.Experimental results on large-scale object dataset COCO demonstrate that the DCTMask masking generation method is more effective than the original masking generation method,and achieves better segmentation results in different instance sizes.To address the low efficiency of generating masking with large-scale object,target mask generation based on DCTMask generation method was introduced,and the Swish activation function was introduced to balance low complexity and high quality.Experimental results on Cityscapes dataset also demonstrate that the proposed method is effective. |