Font Size: a A A

Methods For Handling OOV In Chinese-uyghur Neural Machine Translation

Posted on:2019-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:L N G E M H M T GuFull Text:PDF
GTID:2428330566966986Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The existing researches mainly focuse on statistical methods.Recently,neural machine translation has achieved considerable results in multiple language pairs.Its performance has even surpassed traditional statistical machine translation.However,neural machine translation is easily leads to OOV problems because of limited vocabularies.Agglutinate language like Uyghur,with abundant morphological system,theoretically has unlimited vocabularies,and face more serious OOV problems in neural machine translation.Therefore,this paper studies the Chinese-Uyghur neural machine translation technology based on RNN to handle OOV.In terms of data,this paper manually collects and constructs a written and a spoken language Chinese-Uyghur parallel corpus;In terms of platforms,this paper builds a statistical machine translation platform based on Moses and a neural machine translation platform based on Tensorflow;In terms of models,this paper proposes a memory-augmented neural machine translation model;In terms of experimens,three groups of comparative experiments are conducted to verify the feasibility of the proposed method and idea of reducing OOV:(1)This paper conducts experiments on Chinese-Uyghur machine translation based on three different systems.They are Phrase-Based Statistical Machine Translation(PBMT)using Moses,attention based Neural Machine Translation(NMT)and Memory-Augmented Neural Machine Translation(M-NMT)using Tensorflow,respectively.The BLEU scored PBMT(30.46)<NMT(32.40)<M-NMT(34.20),and just the fluency of translation has improved from PBMT to NMT;while the adequacy and fluency of translation both have improved from NMT to M-NMT.The OOV in the result are NMT(1590)<M-NMT(1443)<PBMT(569).It shows that the proposed M-NMT model not only has better translation performance but also can handling the OOV problem that is aggravated by limited vocabulary in NMT.(2)A Chinese-Uyghur M-NMT experiment based on partial segmented data are performed.In order to further handle OOV,the low-frequency Uyghur vocabulary in corpus are segmenged into the form of “ stem +affix”.Then Chinese-Uyghur M-NMT comparison experiments on the original data and the segmented data are performed under the same parameter settings.Although the partial segmentation method increases the computational complexity and requires more system memory,experiments have shown that this method can:(a)reduce the size of the vocabulary;(b)handle the OOV problems;(c)enhance the translation results.(3)A similar word replacement experiment based on NMT are conducted.This paper attempts to use the open source word2 vec tool to train word vectors based on the Chinese corpus side.The OOV in Chinese-Uyghur M-NMT results are then reduced by replacing the words that are not in the NMT vocabulary but in the test set with its similar words in NMT vocabulary.Experimental results indicate that the idea of substitution is indeed effective.
Keywords/Search Tags:Chinese-Uyghur machine translation, OOV, statistical machine translation, neural machine translation, memory-augmented neural machine translation, stem affixes, word2vec
PDF Full Text Request
Related items