| Image matting is the process of extracting soft foregrounds from an image.As a basic image processing technology,it has many important applications in editing and synthesizing images and videos,virtual reality and augmented reality,as well as film production.The result of matting is represented by an alpha matte that contains the opacity level of the foreground(as the value of α)at each pixel.Unlike the hard foreground generated by image segmentation where the a is either 0 or 1,the value of α in the alpha matte is between 0 and 1.Image matting is usually an ill-posed problem.Therefore,in addition to images,many image matting algorithms require users to make auxiliary inputs such as trimap(a three-value semantic map)to obtain additional guidance.In order to avoid the huge labor cost of making auxiliary input,many researches on automatic image matting methods have emerged.Most of the existing automatic image matting methods are based on the design of a network to generate asemantic trimap estimate and use the estimated semantic map to guide matting.The application scenarios of matting become more and more complex,and users need to face complex scene changes.When matting objects are simple,users hope to improve the efficiency of the matting process.On the contrary,when matting objects are complex,users tend to sacrifice efficiency to improve matting accuracy.Unfortunately,the current matting network is not flexible and cannot change with the change of application scenarios.In view of the above problems,we first try to model the image matting problem as a multi-stage task.The multi-stage automatic matting network we propose consists of three stages applying attention mechanism.Each stage is similar and can be combined with the previous stages to form an independent network.Each independent network can generate alpha matte estimation independently and the accuracy of the result is gradually improved.Through the above methods,we provide users with a fully automatic matting network without auxiliary input,and users can flexibly balance accuracy and efficiency.At the same time,we find that the current matting methods generally define a unified groundtruth semantic image for different types of foreground by performing the same morphological operation on the groundtruth alpha matte.However,due to the complexity and non-uniformity of the natural image foreground,unified morphological operations may generate unreasonable semantic images.In addition,the full automatic matting model will not design a large network to estimate the semantic map considering the memory occupation problem,which will lead to inaccurate estimated semantic map,and thus will not effectively guide the subsequent matting task.In response to the above problems,we propose the definition of generalized binary mask,which divides images into significant and non-significant categories according to the type of foreground,and labels the images in Adobe dataset as groundtruth semantic maps.At the same time,generalized binary mask reduces the difficulty of network objectives compared with three-value semantic maps,so that the network can obtain more accurate semantic map estimation.On this basis,we also propose an automatic matting network based on the generalized mask.The network is composed of two subnets.The mask subnet uses a lightweight model to estimate the generalized mask from the input color image,and then feeds it with the original image to the matting subnet constructed based on the improved Swin Transformer and progressive fine tuning module to generate the final alpha matte estimate.The two models we proposed are superior to the most advanced automatic image matting model to a large extent. |