Text-to-Image synthesis is a task of cross-modal representation that aims to generate photo-realistic images from text descriptions.It involves research in both natural language processing and computer vision and has great potential for application in a variety of real-world scenarios,making it of significant research value.The mainstream model in this field currently uses Generative Adversarial Networks to realize text-to-image synthesis,and the generated image effect is excellent.However,there are still some problems with this approach.Firstly,the commonly used multi-stage structure of the intermediate stage network has repetitive work,and the network is not used efficiently to focus on image detail refinement.Secondly,the existing text-to-image synthesis uses a limited amount of training data,and the semantic space learned by the model is not accurate enough to guarantee the quality of each image generation.To address these problems,the following research is conducted in this thesis.To address problem 1,the thesis proposes a multi-path text-to-image synthesis structure that is based on feature fusion.This method aims to establish an efficient feature fusion mechanism on the multi-stage text-to-image synthesis task to improve the quality of generated images.The proposed multi-path structure mainly has two components: staged residual connection and multi-scale module.Staged residual connection is employed to transfer the feature maps of the generated image from the previous stage to the end of current stage.This path can avoid the requirement of long-term memory and guide the network focus on modifying and enhancing the details of the generated image.The multi-scale module is explored to extract feature at different spatial levels and adaptively integrates these feature maps from different spatial levels through channel attention mechanism.This process generates images with richer and finer details.The suggested multi-path method can serve as a common framework for multiple multi-stage models aimed at generating highly intricate images.Regarding problem 2,this thesis proposes a method for text-to-image synthesis based on semantic data augmentation.To address the issue of insufficient model training data,this approach utilizes a loss mechanism that incorporates semantic data augmentation and a module for aligning semantic information between text and images.The loss mechanism acquires an upper bound loss that can be computed probabilistically utilizing semantic data expansion,avoiding the explicit expansion of a large sample set.In addition,a text-image alignment module based on contrastive learning is incorporated into the model to further improve the consistency of semantic information between text and generated images.The proposed method has undergone extensive experimentation using the existing datasets of CUB-200 and COCO.The results of the experiments demonstrate that the proposed method effectively utilizes the feature information available in images and text,leading to the significant improvement of the quality of images generated by the text-to-image model. |