Research On Data Reduction Methods For Neural Machine Translation

Posted on:2020-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Xu

Full Text:PDF

GTID:2415330578979437

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Neural machine translation(NMT)is the latest technology in the field of machine translation,which requires large-scale bilingual parallel data as training corpus.There are many kinds of open bilingual parallel data sets,and the quality of data varies.However,large data sets have the data redundancy problem,while low quality data sets contain more noise.These data problems will increase the training cost of the model and affect the performance of the model.In order to reduce the impact of bilingual parallel data sets on NMT,this paper will study data reduction for neural machine translation from two aspects:data scale and data quality.The main work includes:(1)A static data selection method based on sentence embedding.In large-scale bilingual parallel data sets,there are usually many parallel sentence pairs with similar semantics,and the contribution of similar parallel sentence pairs to the model is similar.These similar parallel sentence pairs do not help improve the performance of the model,but increase the training cost of the model.In order to reduce similar parallel sentence pairs in bilingual parallel data sets,a static data selection method is proposed to reduce the size of bilingual parallel data sets on the basis of sentence semantics in this paper.In the Chinese-English translation task of the United Nations,the static data selection method helps to reduce the training time of the model,and achieves the performance of the model trained on large-scale data sets.(2)A dynamic data selection method based on training loss.One of the characteristics of NMT is the need for large-scale bilingual parallel data sets as training data.According to this characteristic,a dynamic data selection method based on training loss is proposed in this paper,which gradually reduces the scale of training data in the training process.In the Chinese-English translation task of the United Nations,the dynamic data selection method not only reduces the training time of the model by half,but also improves the performance of the model.(3)Parallel corpus filtering.Aiming at the noise filtering task of low-quality bilingual parallel data sets,we train a noise classifier in cross-linguistic semantic space to recognize the noise in noisy parallel data sets,and propose to enhance the classification performance of the classifier by enriching the diversity of negative samples.In the task of WMT German-English parallel corpus filtering,NMT trained on filtered German-English corpus achieves better translation performance.

Keywords/Search Tags:

Neural Machine Translation, Bilingual Parallel Data, Sentence Embedding, Training Loss, Noise Filtering

PDF Full Text Request

Related items

1	Research On The Application Of Constrained Optimization In Neural Machine Translation
2	A Sentence-level Quality Estimation For Neural Machine Translation Based On Subword Regularization
3	The Research Of Deep Learning Method For Machine Translation Modeling
4	Study On The Method Of Tibetan-Chinese Machine Translation Based On Bilingual Aligned Sentence Patterns Library
5	Sentence Level Alignment In The English-Chinese Parallel Corpora And The Application In Machine Translation Studies
6	Sentence Level Alignment In The English-chinese Parallel Corpora And The Application In Machine Translation Studies
7	Research On Attention-Based Neural Machine Translation With Encoder-Decoder Architecture
8	Research And Implementation On Yi-chinese Translation Using Neural Machine
9	Research On Chinese-to-English Machine Translation For Medical Field
10	Research Of Document-Level Neural Machine Translation