Font Size: a A A

Fine-grained Image Generation Model Based On Scene Graph

Posted on:2021-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:F X XueFull Text:PDF
GTID:2428330611999432Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
In the early research of text-to-images,the variational autoencoder(VAE)was the most commonly used method.It can encode the text and then decode to generate the corresponding image.However,due to the limitations of the VAE model,the quality of the generated image is too poor.The current text-to-images mainly uses the method of generative adversarial network(GAN).The GAN can solve the problem of poor image quality.The generator trains the data distribution of the generated image to deceive the discriminator.At the same time,the discriminator is optimized to distinguish between real samples and erroneous samples generated by the generator.So far,most of the papers on text-to-images are only aimed at generating images of a single object,which has made great progress in the generation of single objects.However,there are few studies on the image generation of multiple objects in an image.Although the use of scene graphs can solve the problem of generating multiple objects in an image,the network does not handle the details of the objects in the image well.At the same time,during the training of the model,the process of image generation is not stable,causing the image quality to decrease.In order to solve the lack of details of the objects in the image,this paper proposes to increase the self-attention mechanism in the mask regression network to improve the object details.Since most text-to-image models use convolutional GAN,the convolution operation in convolutional GANs is affected by the local receptive field.If the area occupied by an object in the image is too large,the convolution kernel cannot extract the entire area occupied by the object,resulting in no great relationship between the extracted areas and affecting the overall object generation effect.By introducing the self-attention mechanism to connect the independent regions in the feature map,the problem of insufficient detail generated by the local receptive field can be effectively solved.At the same time,in order to solve the unstable effect during the text-to-image training process,this paper uses progressive growing method in the cascade refinement network.To increase the resolution of the image,can be achieved by adding hidden layer of the cascade refinement network.However,this will cause the generator to learn too many parameters at once,and the optimization algorithm cannot coordinate multiple layers to capture these dependent parameter values.In this paper,by continuously adding hidden layers to the generator and discriminator during training,the model first generates the contour information of the image,and then turns its attention to the filling of details in the image.This can not only stabilize the training of the model but also speed up the speed of network training.This paper uses two datasets to verify our results,namely the Visual Genome dataset and the COCO Stuff dataset.The VG dataset provides manually annotated scene graphs,while the COCO dataset needs to construct a synthetic scene graph from the positional relationships between objects in the image.In order to verify the effectiveness of the proposed model,this paper uses the inception score evaluation criteria to evaluate the quality of the generated image.At the same time,in order to verify whether the scene graph and the generated image are consistent,artificial evaluation criteria are used.From two evaluation,it is proved that the method proposed in this paper can generate better quality images.
Keywords/Search Tags:text generated image, self-attention mechanism, progressive growing, scene graph, GAN
PDF Full Text Request
Related items