| As deep learning continues to advance in the realm of cognition,an increasing number of models are being imbued with creative abilities,text-to-image generation is one such research topic.In the real life,people aspire for text-to-image generation models to possess their own comprehension and reasoning capabilities,ultimately resulting in the creation of vivid and diverse imagery.In the early stage,text-to-image generation works were constrained by the development of generative adversarial networks and computer hardware,so only shallow neural networks could be built on limited computational resources,which unable to fully fit the training data and generate low-quality images.As generative adversarial networks progressed,more and more scholars begin to use multi-stage generative networks with attention mechanism.Multi-stage generative networks could progressively enhance the resolution of generated images,and the attention mechanism could control the fine-grained information in the images.However,attention-based multi-stage generative networks still face some challenges.Firstly,the heavy dependence between the stages of multi-stage networks limits the model’s upper limit performance.Secondly,the attention mechanism fails to fully consider the granularity relationship between text and images,as well as the importance of individual words.With the development of computer hardware,large-parameter single-stage generative networks have emerged with extraordinary competitiveness,but also produce some problems.The most serious problem is that the cross-modal information interaction in single-stage generative networks is weak,which leads to low semantic consistency between text and generated images.Therefore,improving the interaction strength between text and images has become the focus of single-stage generative networks.In response to the aforementioned issues,this dissertation conducts an in-depth exploration of text-to-image generation algorithms based on generative adversarial networks,analyzes the shortcomings of relevant algorithms and proposes corresponding improvements.The specific contents are as follows:(1)A text-to-image generation model based on adaptive cross-modal attention is proposed.Using word attention mechanism is a common practice in multi-stage generative models,but there is a problem of granularity mismatch between words and local image regions in the word attention mechanism.In addition,the discriminators in traditional multi-stage generative models have simple structure and weak feature extraction ability.For the first problem,the proposed model uses global image features and sentence features to modulate word granularity,and uses self-attention mechanism to capture contextual information of words.For the second problem,the proposed model uses a pre-trained image classification model to improve the local and global features extracted by the discriminator,which greatly improves the performance of the discriminator.(2)A text-to-image generation model based on condition optimization and mutual information maximization is proposed.The vast majority of single-stage generative models ignore the differences between different textual descriptions of the same image.In addition,the unsupervised conditional discriminator may not adequately extract image features associated with the input text,which will prevent the discriminant network to make accurate discrimination.For the first problem,the proposed model designs a plug-and-play text condition construction module.The text conditions constructed by text condition construction module can be regarded as random sampling in the neighborhood of the original text condition,which takes into account the differences between different texts in the same image without changing the semantics of the original text condition,and expands the space of text condition.For the second problem,the proposed model proposes a mutual information loss based on contrastive learning to improve the feature extraction ability of the discriminator,which greatly improves the performance of the conditional discriminant network.(3)A text-to-image generation model based on adaptive condition enhancement is proposed.When the text knowledge base is very large,using the proposed text condition construction module will consume huge retrieval time,and there is a lack of diversity in using the same text condition in different stages of the generator.In addition,it is not sufficient to use only sentence and image global features to optimize the discriminator.For the first problem,the proposed model proposes a adaptive condition enhancement module,which mines the relationship between the word and the local image features,and then constructs an adaptive semantic condition from the word features as the enhancement of the sentence condition.For the second problem,the proposed model proposes a cross-modal alignment loss,which uses both words and sentence information to optimize the feature extraction ability of the discriminator.Finally,extensive experiments demonstrate that the proposed model significantly outperforms other state-of-the-art works. |