| Text and image are the most frequently used information carriers in daily life,and text to image is the cross modal correlation task of extracting feature information from descriptive text and generating high-quality images with enough fidelity,good diversity and consistent with text description on the basis of understanding semantics.At present,the mainstream solution is the variant models of Generative Adversarial Networks,such as Attention Generation Adversarial Network AttnGAN,which has achieved good performance in diversity,clarity and semantic consistency,but there is still much room for progress in authenticity.In order to solve the problem of insufficient authenticity of generated images,under the guidance of contrastive learning in the field of self-supervised learning,this paper improves the Deep Attention Multimodal Similarity Model DAMSM and Attention Generation Network in AttnGAN as follows:1)In order to train text representation better,contrastive loss is added in DAMSM.DAMSM can calculate the inter modal loss of description text t and image x,but does not consider the intra modal correlation of different description text of the same image.Therefore,this paper adds a text encoder in DAMSM to extract another feature describing the text t',by minimizing the contrastive loss of text pairs(t,t'),the feature extraction ability of text encoder is improved.2)In order to enhance the consistency between images generated by different description texts of the same image and make it close to the real image,the Attention Generation Network is extended to Siamese Networks.In addition to considering the confrontation loss of generator and discriminator,this paper also adds the contrastive loss between images(x,x')generated under different text descriptions.The contrastive loss is used to minimize the distance between x and x' in the feature space,so that the generator can better learn the internal relationship of different generated images and improve the authenticity of the generated images.The comparative experimental results show that the improved model reduces FID by 24.91% and 32.04%,increases IS by 0.05% and 0.15%,and increases R-Precision by 0.04% and 0.03% in CUB and MS COCO datasets,respectively.This shows that the authenticity of the generated image is greatly enhanced,and the image quality,diversity and semantic consistency are also improvedFinally,this paper completes the design and implementation of the text to image system,implements the research results,and provides users with personalized image generation services. |