Research On Chinese-Myanmar Neural Machine Translation Method With Monolingual Corpus

Posted on:2021-10-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2518306200453254

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

Neural machine translation has achieved good results in multiple language pairs.However,neural machine translation is extremely dependent on the size of parallel corpora,and Burmese is a kind of scarce language.There is no public Chinese-Burmese parallel data set on the Internet Therefore,Chinese-Burmese parallel corpus is extremely scarce.At the same time,monolingual corpus is an important language resource.Compared with parallel corpus,it has the advantages of large number and easy access.Monolingual corpora can train high-quality language models and play an important role in improving machine translation fluency and loyalty.Therefore,for the application of Chinese-Burmese monolingual corpus in Chinese-Burmese neural machine translation,we have completed the following tasks:(1)Chinese-Burmese bilingual vocabulary extraction.Burmese is a kind of resource scarce language.The Chinese-Burmese bilingual vocabulary is an important bilingual resource for the machine translation between Chinese and Burmese.Aiming at the scarcity of Chinese-Burmese bilingual vocabulary,this paper proposes a ChineseBurmese bilingual vocabulary extraction method that combines themes and context features.Specifically,the LDA topic model is first used to obtain the topic distribution of the Chinese-Burmese document,and the bilingual word vector representation is used to map the cross-lingual topic vector to the shared semantic space,and then the words with higher similarity under the same topic are selected as the Chinese-Burmese bilingual candidate Vocabulary,then obtain the vocabulary semantic representation of the candidate bilingual vocabulary related context based on BERT to construct a context vector,and finally weight the candidate bilingual vocabulary by calculating the similarity of the candidate word’s context vector to obtain a higher quality ChineseBurmese translation vocabulary.The experimental results show that,compared with the bilingual dictionary-based method and the bilingual LDA + CBW-based method,the accuracy of the proposed method is improved by 11.07% and 3.82%,respectively.(2)A corpus selection method for monolingual back-translation based on neural topic model.Parallel corpus is the basic resource for building Chinese-Burmese neural machine translation system.Aiming at the problem of the scarcity of Chinese-Burmese parallel corpus,back-translation is an effective method to solve the low-resource language translation.The target-side monolingual corpus is back-translated into the source language,forming pseudo-parallel sentence pairs.However,the disadvantage is that if the Myanmar monolingual data with poor quality and mixed sources is selected,it will affect the effect of machine translation.To this end,this paper proposes a monolingual back-translation corpus selection method based on a neural topic model.First,a neural topic model is constructed using the Burmese data in the collected Chinese-Burmese parallel sentence pairs,and then the neural topic model is used to select The Burmese monolingual corpus related to the training set improves the quality of monolingual back-translation corpus of Chinese-Burmese neural machine translation.(3)Chinese-Burmese neural machine translation method based on iterative backtranslation.Chinese-Burmese neural machine translation requires a large amount of Chinese-Burmese parallel corpus,but Burmese is a resource scarce language.In view of the parallel Chinese-Burmese corpus and its scarcity,the Internet has a large number of Chinese-Burmese monolingual corpus.This paper proposes a Chinese-Burmese neural machine translation method based on iterative back-translation.Firstly,back translation method is used to generate source sentences from a large amount of monolingual corpus on the target side,and then the training data is expanded through dual learning,which effectively solves the problem of the lack of generalization ability of the translation model caused by the scarcity of Chinese and Burmese parallel sentence pairs.Experimental results show that our method can improve the effect of Chinese-Burmese neural machine translation to a certain extent.(4)Realization of Chinese-Burmese neural machine translation prototype system.Based on the above relevant theoretical research,the Chinese-Burmese neural machine translation system is constructed.Using the Pytorch framework,a Chinese-Burmese neural machine translation prototype system incorporating monolingual corpus was developed,which realized the visual display of translation.The modules of the whole system mainly include sentence input / output modules,bilingual word embedding modules and Chinese-Burmese neural machine translation modules.

Keywords/Search Tags:

Chinese-Burmese bilingual vocabulary, Neural theme model, monolingual corpus, Chinese-Burmese neural machine translation

PDF Full Text Request

Related items

1	Research On Chinese-Myanmar Neural Machine Translation Method Integrating Bilingual Dictionary
2	Research On The Construction Method Of Chinese-Myanmar Bilingual Theme Model With Multiple Features
3	Research On The Application Of Chinese-Burmese Bilingual Sentence-level Embedding Semantic Representation Method Based On Neural Network
4	Research On Bilingual Entity Extraction Method Based On Chinese-Burmese Bilingual Corpus
5	Research On The Construction Method Of Chinese-Burmese Parallel Corpus Based On Pivot Language
6	Research On Mongolian And Chinese Machine Translation Based On Monolingual Corpus Training
7	Research And Implementation On Uyghur-Chinese Neural Machine Translation
8	Research On Mongolian-Chinese Neural Machine Translation Based On Monolingual Corpus And Reinforcement Learning
9	Research On Thai-Chinese Machine Translation Optimization Method Under Low Resource Conditions
10	Research On Chinese-Mongolian Neural Machine Translation Based On Monolingual Corpora