Font Size: a A A

Text-to-Image Synthesis Models In Low-resource Scenarios

Posted on:2024-05-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:T T LiuFull Text:PDF
GTID:1528307145996289Subject:Software Engineering
Abstract/Summary:
With the substantial improvement of computing power and the technical innovation of deep learning models,Artificial Intelligence-Generated Content(AIGC)has been widely used in finance,education,medical care and other industries.The task of textimage synthesis is one of the most mainstream tasks in the field of AIGC,which aims to generate images that meet personalized needs with low cost and high efficiency according to text descriptions.However,in practical applications,not only there are few high-quality training samples,but also the current text-image synthesis model has a large parameter scale and lacks the ability to understand the background knowledge of the text,which leads to the increase of application deployment cost,and the quality and diversity of image generation still need to be improved.In order to improve the performance of small-scale text-image synthesis models,this thesis conducts research from three aspects: data enhancement,knowledge enhancement and model complexity reduction.In terms of data enhancement,the number and diversity of samples are increased by the transformation of text and images and image retrieval augmentation to enhance the generalization ability of the model.In the aspect of knowledge enhancement,external knowledge is introduced to enhance the information such as entity and attribute dependencies in the text description,so as to provide relevant prior information for the model.In terms of reducing the complexity of the model,a variety of sparse attention mechanisms are designed by combining the interactive characteristics of text and image data,which reduces the consumption of computing resources under the premise of ensuring the effect of image generation.The main work and contributions of this thesis are as follows:(1)A text-to-image synthesis model with entity-based knowledge-enhanced is proposed in this dissertation: Existing models regard the importance of each word in the text description as equal,lack of background knowledge related to entities,making it difficult to fully explore entity semantics in low data resource scenarios.This thesis designs and implements an entity knowledge enhancement model,which introduces the entity representation based on external knowledge graph as prior information.In the scene of Chinese text-image generation,considering the diversity of Chinese word segmentation,the word lattice segmentation method is introduced to identify all possible entities in the text.At the same time,in order to avoid injecting too much knowledge into the model and causing noise,an entity semantic interaction module is designed to filter entity knowledge representation and then inject it into the model.This thesis verifies the effect of the model on improving the quality and diversity of generated images on four text-image generation datasets,and verifies that the introduction of external entity knowledge is of great significance for low-resource tasks.(2)A text-to-image synthesis model based on data augmentation and sparse attention mechanism is proposed in this dissertation:In order to solve the problem of large image reconstruction loss in the Vector Quantization(VQ)stage of the above model based on entity knowledge enhancement,which affects the generation effect,a model based on data augmentation and sparse attention mechanism is proposed,which aims to reduce the loss of image reconstruction.And to deal with the resulting increase in model complexity.In terms of data,the text and image data are transformed and enhanced respectively to increase the number and diversity of samples to improve the generalization ability of the model.Secondly,the compression rate of the vector quantization stage is reduced to improve the performance of image reconstruction.In order to cope with the increase of model complexity caused by the growth of image sequence,three sparse attention mechanisms adapted to the interaction characteristics of text-image data are designed to improve the effect of image generation while reducing the computational complexity of the model.Finally,for the optimization objective of the model,the cross-entropy loss between the generated image sequence and the real image sequence,the perceptual loss between the generated image and the real image,and the matching loss between the text and the image are combined to constrain the model to generate images that are consistent with the semantics of the text.This thesis verifies the advantages of the proposed method in terms of the quality and diversity of image generation,resource consumption and inference time on multiple text-image generation datasets.(3)A text-to-image synthesis model with attribute enhancement and image retrieval enhancement is proposed:The above models aim to improve the generation results of individual entities and the overall image,ignoring the relationship between entities and the attribute dependence of entities.In the generation task of complex images,it is prone to problems such as entity attribute misalignment and entity relationship error.Therefore,this thesis proposes a model based on attribute enhancement and image retrieval enhancement,which implicitly injects the attribute information corresponding to each entity and the dependency information between entities into the model by parsing and learning the dependency relationship in the text description.At the same time,an image retrieval enhancement method combining global semantics and fine-grained scene graph semantics is designed,which uses text to retrieve similar images from external image databases to provide image prior information for the model.Experimental results show that the proposed method can effectively improve the quality and diversity of image generation for complex scenes.In summary,this thesis proposes three text-image synthesis models to cope with the challenges of fewer training samples,high model complexity,and lack of background knowledge.In the low-resource scenario,considering the problem that the model does not adequately learn the entities in the text description,the external knowledge graph is introduced as prior information to improve the image generation effect of low-resource tasks.However,the reconstruction effect of the vector quantization stage determines the upper limit of the generation effect.The reconstruction effect of the image is improved by reducing the image compression rate,and the sparse attention mechanism is designed to reduce the complexity of the model to process long sequences.In addition,complex scene images containing multiple entities usually need to be generated in practical applications.By introducing the dependency information in the text and retrieving related images as prior knowledge,the problems of entity attribute misalignment and low diversity of generated images in the current model are alleviated.A large number of experimental studies have verified the effectiveness and efficiency of the proposed model.
Keywords/Search Tags:Text-to-Image Synthesis, Knowledge Enhancement, Data Augmentation, Long Sequence Modeling
Related items