Font Size: a A A

RGB-to-NIR Image Translation Using Generative Adversarial Network

Posted on:2021-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2428330605964142Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The modality of an image refers to the description of different attributes of the same object obtained under different acquisition conditions.For example,for pedestrians,in the same posture,the RGB image taken with a common camera under visible light conditions and the NIR image taken with an infrared camera are two different modal images of the object.Obviously,for the same object,the images in the non-stop mode contain different features.The task based on multi-modal collaboration helps improve the object's expressive ability and obtain more comprehensive and accurate information,thereby enhancing and expanding the processing capabilities of other applications such as cross-modal pedestrian recognition.Generally speaking,cross-modal image generation requires learning the one-to-one mapping relationship between two modal images,and has strict output requirements for the generated target modal image:as far as possible to achieve the pixel level of the real target modal image"Exact match".However,when there is a large difference between the modalities,it is very difficult to generate the conversion between the cross-modal images.This is because in most cases,the image-to-image conversion is a poorly defined problem,that is,the conversion relationship between images in different fields is not clear.That is to say,for a given source image,there may be multiple images that meet the defined target domain.The early image-to-image conversion can be traced back to the image analogy algorithm.The algorithm learns the conversion relationship of corresponding pixels from multiple pairs of input-output pictures,and then applies this relationship to the conversion of new pictures to obtain corresponding conversion pictures.In recent years,with its powerful multi-level image feature extraction and representation capabilities,convolutional neural networks have been widely used in many tasks in the field of computer vision,and have demonstrated ex-cellent performance.With the development of deep learning-based generative models Re-searchers began to apply deep learning models to the algorithm,and achieved good conver-sion effects.At this time,the image conversion task can still be regarded as a generalized instance segmentation task.Due to the excellent performance of generative adversarial net-works in the generation of clear images,GAN-based network models began to be used to handle image conversion tasks.At this time,the generator input is no longer random noise,but an image that needs to be converted.The researchers used a large number of one-to-one paired image combination conditions to generate an adversarial network,proposed a general image-to-image conversion architecture and selected a suitable loss function to optimize the model,and achieved good results in many image-to-image conversion tasks.According to different image-to-image conversion tasks,different network structures can be used in combination with different loss functions to train the model.For image-to-image conversion tasks with fully determined target output,using pixel loss as the loss function for model training can achieve consistent image content as much as possible.For those image-to-image conversion tasks that focus on semantic conversion of images,using perceptual loss instead of pixel loss can produce better quality images,such as image style transfer and image super-resolution reconstruction.In addition,generative adversarial networks are an excellent network architecture for image-to-image conversion tasks,and the discriminatory loss of the discriminator can further improve the image generation quality.Although the image-to-image conversion algorithm based on supervised learning has good generalization performance,it requires a large amount of one-to-one pairing data for model training.However,in practical applications,it is difficult to obtain a large number of train-ing data sets for model training.Based on this,a general model that does not require paired training data,CycleGAN,is proposed.This model maintains certain features of the original image by adding a loss of cycle consistency,which can prevent the problem of mode col-lapse,making it unmatched The conversion problem of image conversion becomes possible.However,image-to-image conversion based on unsupervised is a difficult problem,which requires additional assumptions to be constrained.For example,the shared latent space hy-pothesis,which assumes that when two dual images from different fields are mapped to a shared latent space,the same hidden encoding will be obtained.The hidden encoding can be used as an intermediate connection between the two images.The mutual conversion.An unsupervised image-to-image conversion model based on this assumption combined with variational autoencoder(VAE)and adversarial generation network(GAN)has achieved good results in many computer vision tasks.The cross-modal image generation studied in this paper is a special image-to-image conver-sion,which requires learning the one-to-one mapping relationship between the two modal images,and has strict output requirements for the generated target modal images.Although existing image-to-image conversion algorithms can obtain images that look more realistic in many areas in many tasks,the pixel value and image structure between the generated im-age and the real image are still relatively large.Therefore,cross-modal image generation in this paper requires pixel-level consistency between the generated image and the real im-age as much as possible while retaining image structure information as much as possible.And based on this,a solution was proposed:the edge loss function was used to replace the content loss function in the original model,which was used to solve the edge blur of the generated image in the original model and the weakening of the generalization ability of the generator because the content loss function was too constrained The problem.At the same time,local pixel conversion is introduced for the pre-training of the generator to improve the pixel conversion accuracy of the generator.The specific work is described in the following sections:(1)An edge loss function is proposed to replace the content loss function in the original Cy-cleGAN model.This is because,first,edge loss can effectively improve the edge accuracy of the generated image.Regardless of the RGB mode image or the NIR mode image,the re-sults of edge detection between them are consistent.Secondly,for the content loss function,its constraints make the generator's output and input tend to be consistent,and the genera-tor's generalization ability is weakened.Specifically,given a real sample Xk,the generated sample Xk still corresponds to the original real sample.It is meaningless to minimize the content loss D(Xk,Xk)2.On the contrary,for the edge loss function,even if the given edge is exactly the same as the generated edge,due to the consistency of the edge of the RGB image and the NIR image,it is still meaningful to minimize the edge loss.For pixel conver-sion between regions,pre-training is performed by a generation model using local pixels as described below.(2)Through image segmentation,expand the symmetric training set to complete the pre-training based on pixel conversion.Cross-modal image generation requires learning the one-to-one mapping relationship between the two modal images,and has strict output re-quirements for the generated images.However,current image-to-image conversion algo-rithms are unable to obtain results that meet actual application requirements when dealing with such problems.Based on this,this paper proposes a solution to help the model learn the one-to-one mapping relationship between the two modal images by additionally providing a small amount of real target observation information.First,divide the input high-quality large image into a large number of low-quality small images.It can greatly expand the data set used for pre-training.Limited by sensor devices,it is extremely difficult to capture sym-metric images of different modes of dynamic objects at the same light source angle.It is much simpler to obtain the cross-modal symmetric image of the static background.Since the generator in the pre-training stage only focuses on local pixel conversion,it does not care about the global image structure.Therefore,the use of local pixel pre-training can also be pre-trained by using other more easily available trans-modal symmetric data sets based on static objects.(3)Add the discriminator loss function in the pre-training stage of the generator.For the instance segmentation task that is also pixel conversion,the target loss function to be op-timized is D(Ik,Ik)2.In image conversion,this is far from enough.This is because the loss function of instance segmentation only cares about the average loss of all pixels,not the structure of the generated image.At the same time,based on the shared latent space assumption,the deep generative model encodes the implicit knowledge in a domain by map-ping the knowledge of a domain into the latent space,and then can generate specific samples in the learning domain by controlling the latent variables.Therefore,once the model learns a set of conditional generating samples in a domain,it will become difficult to use it to gen-erate another set of conditional samples,and may even require a complete retraining of the model.Therefore,by adding discriminator loss,the local structure of the image generated by the generator can be constrained,and the potential coding space of the generator can be expanded.That is,in the pre-training stage,the content consistency requirements and se-mantic consistency requirements of the generator are simultaneously guaranteed.
Keywords/Search Tags:cross-modal image generation, generation adversarial network, image-to-image translation, convolutional neural network, pedestrian re-identification
PDF Full Text Request
Related items