Font Size: a A A

Research On Chinese-Mongolian Neural Machine Translation Based On Monolingual Corpora

Posted on:2021-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y C CaoFull Text:PDF
GTID:2428330602996199Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine translation is an important research direction in the field of natural language processing,and neural machine translation(NMT)has bacame the mainstream method of machine translation research and application with the rapid development of deep learning.However,the disadvantages of NMT which relies heavily on large-scale parallel corpora to obtain better translation results still exist.Therefore,NMT is ineffective in low-resource language pairs translation such as Chinese and Mongolian.Compared with parallel corpora,monolingual corpora are more abundant and easier to obtain,and play an important role in low-resource machine translation.But monolingual corpora are still not well applied in NMT.In view of the shortage of Chinese-Mongolian parallel corpora resources and the complex rules of Mongolian word formation,this thesis explores the application of monolingual corpora as a supplement to parallel corpora in Chinese-Mongolian NMT,and proposes several Chinese-Mongolian NMT methods based on monolingual corpora.The main work of this dissertation includes:(1)This dissertation proposes a Chinese-Mongolian NMT method combining word embedding alignment and language modeling.First,Chinese and Mongolian word embeddings are trained using Chinese and Mongolian monolingual corpora respectively,and then the word embedding layers of the model are initialized with the aligned Chinese-Mongolian word embeddings.At the same time,monolingual corpora are exploited to train language modeling in the process of translation,so as to enhance the encoding and decoding ability of the model.(2)This dissertation proposes a Chinese-Mongolian NMT method based on character-level language modeling.As the NMT system is difficult to handle unknown words or low-frequency words,a NMT method based on character-level language modeling is designed to mitigate this problem.This method divides Chinese words and Mongolian words into characters,so that the model can deal with the unknown words or low-frequency words which do not appear in the corpora.In addition,thanks to the dual structure of the model,character-level language modeling can be performed during the translation process which makes the translation result smoother.(3)This dissertation proposes a Chinese-Mongolian NMT which combines weight sharing and character-aware language modeling pre-training.In order to make better use of the commonality between languages,the parameters of the first few layers of the model encoder are shared.In addition,character-aware language modeling pre-training is added to the translation model.More specifically,monolingual corpora are utilized to pre-train the whole model based on character-aware language modeling,and then the translation model is initialized by the pre-trained model before beginning translation.Finally,the model is trained for translation and character level language modeling is added to the first half of the training to fine tune the full model,so as to improve the translation performance.This dissertation explores the application of monolingual corpora in Chinese-Mongolian NMT and proposes Chinese-Mongolian NMT method combining word embedding alignment and language modeling,character-level language modeling,weight sharing and character-aware language modeling pre-training.The experimental results demonstrate that the three Chinese-Mongolian NMT models based on monolingual corpora can significantly improve the effectiveness of Chinese-Mongolian NMT.
Keywords/Search Tags:Chinese-Mongolian neural machine translation, Monolingual corpora, Word embedding alignment, Language modeling, Weight sharing
PDF Full Text Request
Related items