Font Size: a A A

Research On Mongolian-Chinese Neural Machine Translation Based On Data Augmentation And Pseudo-Parallel Corpus

Posted on:2022-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y C LiuFull Text:PDF
GTID:2518306542476554Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Neural machine translation technology has gotten great development and achieved many research results.However,in the Mongolian-Chinese machine translation task,performance of neural machine translation technique is unsatisfactory.Neural machine translation technique relies on large-scale parallel data.That is exacly the reason why neural machine translation technique can not perform well when dealing with lowresource languages such as Mongolian-Chinese translation tasks.This paper focus on building pseudo parallel corpus to dealing with less available parallel data in Mongolian-Chinese machine translation task by applying data augmentation methods.First,pseudo parallel corpus has been built by applying easy data augmentation and back translation method.Easy data augmentation method transforms Chinese sentences use four different ways to construct Mongolian-Chinese pseudo parallel corpus,then use Mongolian sentence corresponding to the real Chinese sentence as the label of Chinese sentence newly constructed.Back translation method use a Mongolian to Chinese back translation model to construct pseudo parallel corpus.The most appropriate Mongolian monolingual sentences are selected according to the unpredictable Mongolian words when translate from Chinese to Mongolian and the contextual environment of the sentence in the process of filtering Mongolian monolingual corpus.The back translation model is constructed on conditional generative adversarial networks.Back translation task is fulfilled by the generator of conditional generative adversarial networks,the discriminator optimized the generator to generate more close to the real Chinese sentences.Second,byte pair encoding algorithm is applied during data pre-processing phase,making the Mongolian-Chinese translation model capable enough to dealing with rare words to some extent by incorporating multiple granularity vocabulary to improve the robustness of the Mongolian-Chinese translation model.The pre-training model of embedding from language model is adopted to capture the word meaning information,syntactic information and semantic information of Chinese words during pre-training phase,and all three kind of information mentioned above are merged through a linear function to obtain word embedding vectors dynamically,so as to make a more accurate expression and improve the quality of the translation model.Finally,a Mongolian-Chinese translation model is constructed based on the Transformer architecture,where soft context data augmentation technique is incorporated to select some Chinese words randomly which is going to be inputted into the embedding layer of the encoder of the translation model.By smoothing the one-hot vectors of those Chinese words,new vectors containing more linguistic information is obtained,which are called soft word vectors.These soft word vectors are inputted into the embedding layer of the encoder of the translation model.Experimental results prove that the combination of the above methods can improve the performance of the Transformer model.
Keywords/Search Tags:Mongolian-Chinese Neural Machine Translation, Data Augmentation, Byte Pair Encoding, Pre-Training
PDF Full Text Request
Related items