| It is easier for people to understand vivid image information than abstract and complex text information because image information can better highlight the focus.However,it is more difficult to obtain image information that matches the textual information.The task of textguided image generation requires a combination of two fields,computer vision and natural language processing,and belongs to the cross-field research.The task generates semantic images based on the given text describing details such as shape and color of an object.A single text description can correspond to multiple visual contents with different pixels,so the difficulty of this task is to generate clear,natural and diverse images while matching the semantics of the input text.Currently,the main approach to the text-guided image generation task is to use generative adversarial networks and their improved algorithms to generate high-resolution images using a multi-stage generative adversarial network architecture that progressively generates images of different resolutions.However,this architecture is unstable and time-consuming to train,has a large number of network parameters and computational effort,and the generated images look like a stack of simple information and lack details and realism.In this paper,based on the current development status and problems in the field of text-guided image generation,the main work is as follows:(1)In order to solve the problems of poor visual effect and diversity of generated images and lack of detailed information,this paper proposes a triple attention-based generative adversarial network(TAGAN).The model uses a pair of generators and discriminators,in which the generators combine the triple attention mechanism to continuously extract text features and improve image detail information in the upsampling process,and effectively fuse the two features to generate semantic and clear natural images.On the other hand,to help the generator converge,the discriminator uses a one-way output mechanism to direct the valid results to the real and matching data pairs to provide accurate directions,and uses a match-aware gradient penalty to improve the degree of matching between the generated image and the input text.(2)To address the problems of increasingly complex models of text-generated images with large number of parameters and long training time,this paper proposes a lightweight feature fusion generative adversarial network(LFGAN),where the generative network reuses text information in combination with conditional convolution and dense connectivity during forward propagation and uses text information as a condition to adjust the visual effect of the generated images.Meanwhile,in order to improve the visual effect of the generated image,this paper uses BERT text encoder and perceptual loss function to improve the generator’s understanding of the text information and the matching of the two features,thus enhancing the detail information of the generated image.This model adopts a simple monolithic structure and supplements the missing information during the generation process,so the number of model parameters is greatly reduced while achieving visual effects comparable to those of the comparison model. |