Deep learning belongs to machine learning that simulates the structure and function of human neural system through deep networks such as Convolutional Neural Networks and Auto-Encoder Neural Networks.The continuous advancement of Deep Learning and related algorithms has made text-to-image generation one of the research hotspots in the fields of Artificial Intelligence and Computer Vision.Text-to-image generation aims to generate a matching image through a neural network model based on textual descriptions,which integrates multiple modalities of information such as text,noise and images.In traditional methods of image generation research,generated images often suffer from issues such as distortion,low resolution and inconsistency with text description.In this paper,two networks based on Deep Learning are proposed.With the improvement of text processing network,it is a more effective text encoder for text feature processing to propose.And attention mechanisms and multi-head attention mechanisms are introduced into the network to improve quality and matching degree of generated images.In chapter three,a text-to-image network model based on Conditional Augmentation and attention mechanisms is proposed,which consists of two parts: a text processing network and a Generative Adversarial Network.Bidirectional Long Short Term Memory networks is adopted to text processing network.It is used to extract and process text features.Conditional Augmentation module is employed to enrich semantic features and add text feature data.In GAN,text features are fused with visual features.Then fused features are adjusted from two dimensions of channel and space using attention mechanisms,so that generating network focuses on the important features of text description and ultimately produces generated images.The network model is optimized by adversarial loss function discriminating between generated images and real images.Experiments are conducted on the MSCOCO and CUB birds 200 datasets to evaluate corresponding metrics,and this method demonstrates significant advantages over other methods.In chapter four,a text-to-image generation model based on pre-trained CLIP and Transformer network is proposed.The model consists of two parts: a text processing network and a Generative Adversarial Network.The pre-trained CLIP is adopted to text processing network.It is employed to extract and process text features.In the generator of GAN,a multi-layer perceptron is adopted to implement non-linear mapping of features.Then features are feed into Transformer Encoder and Upsampling network for feature extraction.The multi-head attention mechanism in Transformer Encoder significantly improves efficiency of processing long text sequences,establishing correlations between text and images,as well as parallel computing.Generated images are reconstructed through a linear unflatten layer.In discriminator,generated images are input to Transformer Encoder and linear flatten layer for feature extraction,and they are compared with real images.Adversarial loss functions are designed for each part of the network.The Multi-Modal-Celeb A-HQ and CUB birds 200 datasets are employed for experiments,and corresponding metrics are evaluated to demonstrate better generation performance of this method. |