Font Size: a A A

Image Captioning Based On Variational Model

Posted on:2022-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:L Z YangFull Text:PDF
GTID:2518306485966379Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Computer Vision(CV)and Natural Language Processing(NLP)are two of the most prominent directions in the field of artificial intelligence in recent years,and they have both independent development space and cross-combined sub-topics.For example,the mutual transformation between image and natural language,it involves two opposite processes from image to natural language and conversely.The process from image to natural language is also widely known as Image Captioning.Understanding the visual contents in images and representing them in natural language is a challenging topic,not only in terms of academic value as an important step in exploring the understanding and representation of intelligence,but also plays a key role in automatic annotation for datasets in many downstream tasks.Therefore,Image Captioning has become one of the current research hotspots in cross-media fields.As one of the pioneers in the field of cross-media,Image Captioning has reached its bottleneck when improving the value of evaluation metrics,therefore the research on Image Captioning gradually turns to the direction of diversity,stylization and controllability,with the goal of generating more "human-like " descriptions,and also to generate fixed types of image descriptions under given conditions,to solve the problem of poor performance of generic models in specific scenarios due to the lack of samples.In this thesis,based on this two aspects,this thesis explores two types of Image Captioning methods under the general scenes and the specified scenes,and analyzes the problems of existing technologies and proposes some new research ideas.(1)To balance the performance between accuracy and diversity in Image Captioning,this thesis studies the technical routes and model methods of existing encoder-decoder models and variational models,and proposes a new solution to the inherent contradiction between diversity and accuracy metrics,i.e.,to increase the variability of each sample while ensuring the upper bound of accuracy score with multiple sampling.Based on the Variational Auto-Encoder(VAE)model,this thesis uses the random noise from the sampling of the variational process to increase the diversity of the generated captions.In order to solve the problems when restricting the prior distribution into a normal Gaussian and the adaptivity issue in the variational process,this thesis also proposes an end-to-end loss function for Gaussian Mixture Models to perform an adaptive training process,which leads to a significant improvement and enhancement compared with the vanilla VAE model.The experimental results demonstrated that our model achieved 130.6 and 25.4 on CIDEr(accuracy metric)and Div2(diversity metric).Comparing to the original Transformer model,the proposed method shows an improvement of 0.3 and 1.6 respectively.(2)To simplify the complex structure and reduce the quantity of parameter in variational models,this thesis further proposes a Variational Attention Model.By embedding the variational model into the Multi Head Attention Model,the parameter requirements of the model are greatly reduced,and the model accuracy and diversity performance can be guaranteed at the same time.In order to solve the problem of calculating the attentional features in the variational attention model,this thesis proposes a feature similarity calculation method using the posterior inter-distribution distance as the measure,and explores the applicability of various distribution distance calculation methods to the variational model.In consequence,this thesis proposes an inter-distribution variance estimation algorithm using the sampled feature projection length as the measure.The experimental results demonstrated that our second proposed method achieved 129.6 and 25.34 on CIDEr and All SPICE,comparing to 130.6 and25.15 which are produced by the first proposed method.However,the amount of parameters of the second method is decreased to 267.8M,while the first method has465.5M parameters.It can be seen that the second method greatly reduces the number of model parameters and structural complexity while ensuring a balance between accuracy and diversity.
Keywords/Search Tags:Image Captioning, Variational Auto-Encoder, Gaussian Mixture Prior, Variational Attention
PDF Full Text Request
Related items