Font Size: a A A

Parallel Corpus Generation And Filtering Method For Chinese-Thai Neural Machine Translatio

Posted on:2024-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:A B ChenFull Text:PDF
GTID:2568307109988129Subject:artificial intelligence
Abstract/Summary:PDF Full Text Request
With the continuous deepening of the research on NMT,the translation models trained between high resource language pairs have achieved good results.However,since the performance of NMT models depends on the quality and scale of the training corpus,the performance of translation models is poor due to the scarcity of training data between low resource language pairs.To solve this problem,expanding training data has become the main solution for researchers at present,including generating pseudo-parallel corpora through models,or extracting a large number of parallel corpora from bilingual websites,multilingual websites,comparable corpora,etc.Most of the methods for generating pseudo-parallel corpus are based on back-translation models or some other methods.However,due to the limited amount of data in the Chinese-Thai parallel corpus,these technologies are often affected,and the quality of the generated pseudo-parallel corpus is low.The other kind of parallel corpus obtained from the network has a lot of noise,which needs to be filtered by technical means,and the performance of the filtering method needs to be further improved.This paper explores the above problems from se veral aspects on the basis of the Chinese-Thai language pair in order to ultimately improve the performance of the Chinese-Thai neural machine translation model.The main contributions of this paper are as follows:1)Research on the generation method of Chinese-Thai pseudo-parallel corpus: Aiming at the problem of pseudo-parallel corpus generation,this paper greatly improves the performance of back-translation model combined with pivot language by improving the generation antagonism network model,thus effectively improving the quality of pseudo-parallel corpus.There are two main innovative methods: a.One improvement is to add an additional discriminator between the source language and the target language on the basis of only one discriminator between the source language and the pivot language in the original generated adversarial network,so that the final generated pseudo-parallel corpus is closer to the distribution of the training set;B.Another improvement is that the feedback information returned b y the discriminator can not fully consider the word-level information during the adversarial training.Therefore,a method using part of speech substitution instead of Monte Carlo search algorithm is proposed to construct intermediate sentences.Finally,t he generated pseudo-parallel corpus is used to train the same translation model,and the BLEU value is used as the evaluation standard,which is 2.04 and 2.09 BLEU higher than the baseline method in the direction of Chinese-Thai translation and Thai-Chinese translation respectively.2)Research on the filtering method of noise parallel corpus: This paper mainly proposes a method based on word embedding to filter the noise parallel corpus:a.First,the single-language word embedding of each language is obta ined through the pre-training language model;B.Then,two kinds of monolingual word embedding are mapped in the same vector space by using the method of bilingual word alignment to obtain bilingual word embedding;C.Finally,each parallel sentence pair to be filtered is scored by bilingual word embedding and noise filtering is performed according to the score.Finally,the filtered parallel corpus is used to train the same translation model,and the BLEU value is used as the evaluation standard,which is 1.73 and 0.99 BLEU higher than the baseline method in the direction of Chinese-Thai translation and Thai-Chinese translation respectively.3)Application of pseudo-parallel corpus generation and noise corpus filtering in Han-Tai neural machine translation: This part analyzes and considers the contents mentioned above,and combines the pseudo-parallel corpus generated in the previous text with the noise filtering method to further improve the performance of the finally trained neural machine translation mode l.Finally,some other experiments are designed to explore how to more effectively combine various schemes to expand parallel corpus,so as to further improve the performance of the translation model on the same number of data sets.4)Design and implementation of Han-Tai neural machine translation prototype system: This part integrates the methods described in this article and constructs the Han-Tai neural machine translation prototype system,which includes three modules,namely,input and output module,sentence preprocessing module,and neural machine translation module.
Keywords/Search Tags:Neural machine translation, Noise, Pseudo-parallel corpus, Generate adversarial network, Back translation
PDF Full Text Request
Related items