Font Size: a A A

Research On Multimodal Machine Translation Based On The Fusion Of Visual Features And Semantic Information

Posted on:2024-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q YuFull Text:PDF
GTID:2568307076973569Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Multimodal machine translation refers to the fusion of information from other modalities with text information basis on the text-only corpus translation,and order to improve the quality of machine translation.An analysis of current research on multimodal machine translation reveals some issues that exist in current research efforts.Firstly,at present,most of the research work focuses on the languages with abundant language resources,while the research work on the low-resource languages with scarce language resources is less.Secondly,a large amount of research is devoted to seeking the fusion of visual and text information,but in this process,the semantic consistency between bilingual text information and visual information during encoding and decoding is often overlooked.Thirdly,current research is based on datasets with human annotations.Although human-annotated data ensures relative accuracy,the high cost and large expenditure on human annotation also affect the development of multimodal machine translation research to some extent.To address the above-mentioned issues,this article conducts the following research work:1.The paper proposes an improved multimodal machine translation framework for low-resource languages.We use different models to extract image feature information and fuse visual features with text information during the translation process to assist the translation model and conduct experimental studies using low-resource language data.In addition,we also analyze the impact of image feature information extracted by different models on translation results.The experimental results show that our method can achieve better translation results in low-resource languages.2.The paper proposes a multimodal machine translation framework that integrates visual attention in the encoder-decoder end to fuse visual information.The method integrates visual information in both the encoder and decoder to learn the interaction between visual and text features.Visual information provides global context,so the encoder,and decoder can learn bilingual representations.Additionally,we introduce a new bilingual visual-consistent decoder to better represent corresponding image-sentence pairs.Experimental results show that our proposed method can effectively utilize visual information to improve machine translation performance.3.The paper proposes a text-image retrieval method that uses text-image feature encoding to solve the problem of manually annotated data.This method encodes both text and images,maps them to vector space,and uses cosine similarity to find the image with the highest similarity to the text.Then,the feature information of the matched image is extracted to assist the system in translation.Experimental results show that our method can effectively improve pure text machine translation results,alleviate the problem of difficult data annotation,and validate the effectiveness of our proposed method.
Keywords/Search Tags:Multi-modal, Visual Information, Information Fusion, Low-resource Language, Text-image Matching, Multi-modal Machine Translation
PDF Full Text Request
Related items