Font Size: a A A

Research On Multimodal Machine Translation Method Based On Visual Information

Posted on:2022-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:P B LiuFull Text:PDF
GTID:2518306569496574Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Multi-modal machine translation for visual information refers to the use of image or video information as an aid to help the model understand the context on the basis of text machine translation to improve the performance of the machine translation system.The usual method is to fuse information of two different modalities at the encoder.This paper studies the two subtasks of visual information-oriented multimodal machine translation: text-image machine translation and text-video machine translation,and analyzes the problems in this field: One is the lack of a unified multi-modal machine translation framework that can be universally applied to the two subtasks;The second is that in the text-video machine translation task,both video features and text features are sequential,which is ignored by existing models;Third,there are content in the image information that is irrelevant to the text.The redundant image information will affect the translation system.Filtering the noise information in the image within the model to select the part that is really contextual is a problem worthy of study.In response to the above problems,this paper has conducted the following three researches:1.Propose a general multi-modal machine translation framework.At present,multi-modal machine translation based on visual information lacks a translation framework that can handle two sub-tasks at the same time.The general multi-modal machine translation model proposed in this paper is based on the visual representation of perceptual text and introduces multi-modal gating.The network selects and integrates visual information,and can handle two multi-modal machine translation sub-tasks in general.Among them,the BLEU index and METEOR index of the text-image machine translation task on the three test sets of Test2016,Test2017,and MSCOCO have reached the best or close to the best results.Compared with the baseline model provided by VATEX on the VATEX dataset,the BLEU value of the text-video machine translation task has increased by 4 points.The other methods in this paper use the general multi-modal machine translation framework proposed in this paper as the baseline model.2.Propose a selective attention mechanism.Selective attention dynamically selects regions in the image features that are closely related to the meaning of the current word during the model training process,and through the method of Gumbel reparameterization,the selection process can be differentiated and the model can update the parameters through backpropagation.As a supplement to image denoising,this paper also introduces the semantic similarity loss function of text image to further constrain the representation information of the two modalities.The experimental results and the comparison of examples show that selective attention can effectively remove the noise in the image features and improve the translation result on the text-image multimodal translation task.3.Propose the relative distance in multimodal attention.In the multi-modal attention in the baseline model,multi-modal distance vectors are assigned to multi-modal feature pairs,and the multi-modal distance vectors are integrated into the attention score matrix calculation and final weighted sum output.The experimental results show that the method based on relative distance can significantly improve the translation results.The method in this article has achieved the fourth place in the VATEX public evaluation list.
Keywords/Search Tags:Neural Machine Translation, Multimodal Learning, Deep Learning, Attention
PDF Full Text Request
Related items