With the development of deep learning,image synthesis has attracted more and more attention,and its application fields have become more and more extensive.In this paper,the synthesis of text to image is studied in depth,that is,learning how to guide the generation of corresponding images through text description.It is required that the generated images not only have authenticity and diversity,but also match the image content with the given text description.Through the study of existing text-to-image models,this paper proposes the following two models:1.proposes TMGAN a text-oriented image manipulation model based on the generation adversarial network,the generator adopts the Transformer encoding and decoding structure to extract global context information which can solve the problem that the generated images were not realistic enough;the discriminator contains two parts,a transformer-based multi-scale discriminator and a word-level discriminator,to give the generator more refined feedback to generate targets image,which simultaneously meet the text requirements and the content of the original image that has nothing to do with the text description.The experiment shows on the public data sets CUB Bird data sets,IS(Inception Score),FID(Fréchet inception distance)and MP(Manipulation precision)metrics reached 9.07,8.64 and 0.081 respectively.The proposed method better than the advance methods,the generated image not only meets the attribute requirements of the given text description but also has high semantics.2.proposes TBMGAN a text-oriented image manipulation model based on multi-stage generation of confrontation network.It generates high-quality and high-resolution images step by step through two stages.In the first stage,it improves the quality of image generation by integrating more perceptual information,and in the second stage,it extracts image features adaptively through dynamic memory.At the same time,the model uses Bert text encoder to process text information,Transformer to replace the convolutional network to obtain context information,and the discriminator integrates the word-level discriminator to give the generator more fine-grained feedback.The experiment shows that the measurement indexes of IS(Inception Score),FID(Fréchet inception distance)and MP(Manipulation precision)on the CUB Bird data set reach 9.07,9.15 and 0.085 respectively,while on the COCO data set they also reach 27.88,15.21 and 0.072 respectively.The proposed method better than the advance methods,the generated image can not only generate images with high semantic and high quality that match the input text description on simple data sets,but also perform well on COCO data sets of complex scenes. |