Font Size: a A A

Research On Mongolian And Chinese Machine Translation Based On Monolingual Corpus Training

Posted on:2020-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:X H NiuFull Text:PDF
GTID:2428330590459723Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine translation is one of the most important research topics in the field of artificial intelligence.The main goal is to study how to achieve the automatic conversion from one natural language to another.With the continuous development of the Internet,people are paying increased attention to the development of machine translation.With the advancement of research methods,translation performance has also increased over the past decades.From the initial rule-based machine translation techniques to the now popular neural network-based translation models.Although recently proposed Neural Machine Translation(NMT)methods achieved state-of-the-art results on languages with large-scale and high-quality training corpus,it is far from perfect on rare languages.Because NMT approaches are data-driven in nature,their performances are significantly influenced by the quality of the training data.However,due to the relatively slow economic development in minority areas,it is often more difficult the collect a high-quality training corpus.To address this challenge,this thesis proposes a method based on monolingual corpus training to alleviate the scarcity of Mongolian and Chinese translation corpus,thus further improving the performance of Mongolian and Chinese machine translation systems.In view of the scarcity of parallel corpus resources in Mongolian and Chinese machine translation,this thesis summarizes the three stages of implementing monolingual corpus training: training language models,initial translation models and iterative back-translation optimized initial translation models.In this thesis,the first of the three stages is studied in-depth.It is found that the pre-training of a Mongolian and Chinese language model based on the multi-headed self-attention mechanism can significantly improve the performance of a Mongolian and Chinese machine translation model based on monolingual corpus training.Another contribution of this thesis is to combine corpus with different granularities.The number of unique Mongolian words is huge,and it is possible to use the additional components after word stemming to construct new words.Therefore,the translation model cannot cover all the words,so the problem of Out-Of-Vocabulary(OOV)words will always exist in the word-level translation model.However,the number of Mongolian characters is limited,and the number is small.All the words are composed of character sequences.This kind of character sequences have combination rules and are suitable for neural network learning.Therefore,this thesis jointly combines and considers features from different granularities.The experimental results show that our approach effectively alleviate the problem of OOV words in Mongolian and Chinese machine translation.At the end of this thesis,the performance of the Mongolian and Chinese machine translation model based on monolingual corpus training is compared with the performance of the Mongolian and Chinese machine translation model based on a LSTM neural network.The results are compared and analyzed by the commonly used evaluation standard BLEU for machine translation.Experiments show that the integration of Mongolian and Chinese monolingual and bilingual corpus pre-training across the Mongolian language model can greatly optimize the Mongolian and Chinese machine translation model based on monolingual corpus training,and the optimized translation model and the use of 100,000 sentences are bilingual.The BLEU of the Mongolian-Chinese machine translation model of corpus training is close.Since the monolingual corpus is easier to obtain than the bilingual corpus,the method based on monolingual corpus training has distinct advantages for the Chinese and Mongolian machine translation task.
Keywords/Search Tags:Mongolian and Chinese Machine Translation, Monolingual Corpus, Pre-trained, Cross-lingual Word Embedding, Different Translation Units
PDF Full Text Request
Related items