Font Size: a A A

Research On Cross-modal Semantic Relationship Based Image Synthesis

Posted on:2022-09-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:M WangFull Text:PDF
GTID:1488306560989719Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information techniques and social networks,crossmodal data represented by images and texts,as an important medium for recording information,conveying thoughts and expressing emotions,has already brought a great impact on people's daily lives.Understanding the semantic content of the cross-modal data and accurately analyzing their semantic relationship have become one of the most important research topics in the field of computer vision.Based on the cross-modal data,this thesis engages in two applications,i.e.,text-to-image synthesis and semantic image synthesis.The main contributions of this thesis include the following aspects:1.We propose a saliency-driven text-to-image synthesis algorithm.Text-to-image synthesis task takes text descriptions as inputs to generate photo-realistic images.The existing methods are based on the entity words in the text as features to construct the mapping relationship between the text features and the entity objects in the image space.However,in the text-to-image synthesis task for complex scenarios(i.e.,a generated image contains a complex background or multiple foreground objects),the existing methods independently consider a single text feature,which usually results in a large deviation in the contour(s)and position(s)of the main object(s)in a generated image.Inspired by the human visual attention mechanism,we propose a saliency-driven text-to-image synthesis algorithm,which makes the visual saliency of the generated images consistent with the semantic saliency of the text.In addition,to make up for the details of the images generated by the existing methods,we propose a multi-resolution joint deep network,which can simultaneously fuse feature distributions of different resolutions in the network learning process,so that the generated images have more accurate contour information and detailed information of the entity objects.Evaluating on single-object CUB,Oxford-102 datasets and multi-object MS-COCO dataset,comprehensive experimental results demonstrate the effectiveness of the proposed method.2.We propose a novel text-to-image synthesis algorithm based cost-sensitive learning.The problem of imbalanced sample distribution,especially the problem of imbalanced sample distribution between classes,has a great impact on the performance of the learning algorithms,and usually causes the generator to be biased towards the large sample classes,resulting in a sharp drop in the quality of the generated images of the small sample classes.In text-to-image synthesis task,mainstream datasets(e.g.,CUB and MSCOCO datasets)have high sample complexity,which often exists the overlap samples between different classes.For example,the most typical case is that samples with exactly the same features have different labels.This situation will lead to the introduction of a large number of repetitive or noisy samples in the data sampling process.However,the existing text-to-image synthesis algorithms do not consider the imbalance of sample distribution.Therefore,we propose a novel text-to-image synthesis algorithm based costsensitive learning.This algorithm converts the data sampling problem in cost-sensitive learning into a random coverage problem,and the random coverage problem ignores the impact of sample overlap.Random coverage refers to covering a large set by a sequence of random small sets.Each sample is regarded as a sample point,and each sample point is associated with a small neighboring region instead of a single point.Aiming at the random coverage problem,the proposed algorithm uses the effective number of samples to learn the cost-sensitive factors of different classes,and further learns the biased loss of specific classes,thereby effectively increasing the attention to small sample classes.Experimental results on three public text-to-image synthesis datasets demonstrate that the proposed algorithm is more effective than the existing algorithms.3.We propose an object-driven semantic image synthesis algorithm.The semantic image synthesis task is a subtask of image translation,which takes the semantic segmentation masks as inputs to generate realistic images.The existing semantic image synthesis methods only consider the class label information,so that the generative models cannot capture the rich local fine-grained information of the images(e.g.,object structure,location,contour and texture),but only learn basic discriminative features(e.g.,image global layout).To solve the above-mentioned problems,we adopts a multi-scale feature fusion algorithm to refine the generated images by learning the fine-grained information of the local objects.Specifically,the proposed model first generates multi-scale global image features and local object features respectively,and then the local object features are fused into the global image features to improve the correlation between the local and the global.In the process of feature fusion,the local object features are calculated by standard deviation to generate statistical features,and then fused with the global image features.The fused features are used to construct correlation filters to obtain feature response maps to determine the locations,contours and textures of the objects.Experimental results on four public semantic image synthesis datasets demonstrate that the proposed algorithm is obviously superior to existing algorithms in accuracy.
Keywords/Search Tags:Image synthesis, Text-to-image synthesis, Semantic image synthesis, Attention mechanism, Feature fusion, Cost-sensitive learning
PDF Full Text Request
Related items