Font Size: a A A

Robustness On Neural Machine Translation

Posted on:2023-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2558306845499344Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
Machine translation is a technique that leverages computational linguistics to map the source language into the target language.It is one of the core directions in the field of natural language processing research with the most profound research significance and the most extensive application scenarios,and it is also a natural language technology with the best implementation effect in the industry,and thus has received extensive attention and focused research from both industry and academia.Nowadays,with the continuous innovation of neural networks,the further enrichment of computational data and the gradually decreasing cost of computational resources,neural machine translation has become the most mainstream machine translation method due to its excellent translation performance.However,the strong translation performance of neural machine translation is constrained by high quality training data,which can produce unsuspected translation results when the input test data is not a standard high quality utterance,i.e.the robustness problem of neural machine translation.In available studies,researchers have proposed a number of approaches to address the robustness problem,the most widely used of which is the data-based back translation technique,where a baseline translation model is trained from the original data,and the model is utilised to translate the target language of an existing parallel corpus as well as a large amount of monolingual corpus of the target language present on the Internet as a pseudo-parallel corpus to expand the original parallel corpus,enabling the model to have a stronger translation capability by increasing the diversity of the data.In practice,however,differences in application scenarios introduce a diversity of input noise that cannot be exhaustively enumerated and used as a pseudo-parallel corpus to train models.In order to solve the robustness problem caused by such unenumerable noisy data,the research work in this paper will take the problem of noise caused by Automatic Speech Recognition(ASR)error scenarios and the problem of out-of-vocabulary words(OOV)as noise causing interference to the model as the starting point.We will focus on how to solve the model robustness problem caused by noise by refining the model and compensating the data so that both noisy and non-noisy inputs will output correct translation results.The points of innovation and contribution to this paper are as follows.(1)This paper introduces the robust machine translation of Auto Encoding algorithm for noise,which is an end-to-end neural machine translation ’architecture’ solution that further improves the model’s translation capability for noisy data in ASR error scenarios by introducing a noisy word detection module and a noisy word recovery module to the encoder.This study simulates the ASR recognition error in the NIST translation dataset,brings the error into the NIST Chinese-English dataset to check the performance of the model,and also verifies the performance against other languages on the WMT EnglishGerman data,and finally translates the text using ASR from real-life scenarios.(2)This paper introduces an algorithm for modeling multi-granularity tokenization for out-of-vocabulary complex words,which reconstructs a new input word vector at the data level by re-decoupling and dynamically self-attentively fusing the tokens of complex words at the model level.In this study,the problem of robustness is addressed by introducing randomness during tokenizing deliberately to solve the countless out-ofvocabulary words.The experimental results demonstrate excellent performance on large,medium and small size datasets as well as cross-domain data scenarios.
Keywords/Search Tags:Neural Machine Translation, Robustness, Automatic Speech Recognition, Out-Of-Vocabulary Words
PDF Full Text Request
Related items