Font Size: a A A

Research On Data Reduction Methods For Neural Machine Translation

Posted on:2020-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:X Y XuFull Text:PDF
GTID:2415330578979437Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Neural machine translation(NMT)is the latest technology in the field of machine translation,which requires large-scale bilingual parallel data as training corpus.There are many kinds of open bilingual parallel data sets,and the quality of data varies.However,large data sets have the data redundancy problem,while low quality data sets contain more noise.These data problems will increase the training cost of the model and affect the performance of the model.In order to reduce the impact of bilingual parallel data sets on NMT,this paper will study data reduction for neural machine translation from two aspects:data scale and data quality.The main work includes:(1)A static data selection method based on sentence embedding.In large-scale bilingual parallel data sets,there are usually many parallel sentence pairs with similar semantics,and the contribution of similar parallel sentence pairs to the model is similar.These similar parallel sentence pairs do not help improve the performance of the model,but increase the training cost of the model.In order to reduce similar parallel sentence pairs in bilingual parallel data sets,a static data selection method is proposed to reduce the size of bilingual parallel data sets on the basis of sentence semantics in this paper.In the Chinese-English translation task of the United Nations,the static data selection method helps to reduce the training time of the model,and achieves the performance of the model trained on large-scale data sets.(2)A dynamic data selection method based on training loss.One of the characteristics of NMT is the need for large-scale bilingual parallel data sets as training data.According to this characteristic,a dynamic data selection method based on training loss is proposed in this paper,which gradually reduces the scale of training data in the training process.In the Chinese-English translation task of the United Nations,the dynamic data selection method not only reduces the training time of the model by half,but also improves the performance of the model.(3)Parallel corpus filtering.Aiming at the noise filtering task of low-quality bilingual parallel data sets,we train a noise classifier in cross-linguistic semantic space to recognize the noise in noisy parallel data sets,and propose to enhance the classification performance of the classifier by enriching the diversity of negative samples.In the task of WMT German-English parallel corpus filtering,NMT trained on filtered German-English corpus achieves better translation performance.
Keywords/Search Tags:Neural Machine Translation, Bilingual Parallel Data, Sentence Embedding, Training Loss, Noise Filtering
PDF Full Text Request
Related items