| In recent years,neural machine translation(NMT)has become the dominant paradigm in the field of machine translation due to its outstanding performance.However,the translation quality of NMT models highly depends on the quantity and quality of parallel corpora.In particular translation scenarios such as long sentence translation and noisy sentence translation,the performance of NMT models will drop significantly.Data augmentation methods have been widely applied in general machine translation scenarios to mitigate the problem of insufficient parallel corpora and have achieved remarkable results.However,research on data augmentation methods in the aforementioned specific NMT scenarios is still lacking.This thesis aims to investigate data augmentation methods for two different translation scenarios:long sentence translation and noisy sentence translation.Additionally,improvements to the commonly used back-translation method in data augmentation are proposed to enhance the translation quality of NMT models.The main research contents of this thesis are as follows:(1)To address the issue of poor translation quality of long sentences in NMT,this thesis proposes a data augmentation method based on Random Sentence Concatenation.The main reason for the poor performance of NMT models in long sentence translation is the insufficient quantity of long sentences in the training corpus,which leads to inadequate training of the models on long sentences.To tackle this problem,this thesis proposes a rule-based data augmentation method that generates synthetic corpora of mixed language by randomly concatenating source-side and target-side sentences,in order to expand the original dataset.Experiments were conducted in both supervised and semi-supervised translation scenarios,demonstrating that the proposed method can improve the translation quality of both long and short sentences simultaneously in the supervised scenario.In the semi-supervised scenario,the combination of the proposed method with back-translation further enhances the translation quality.In addition,the attention matrices before and after applying the proposed method were visualized for analysis.The results show that the proposed method can effectively focus on the important parts of the source-side sentences,thus improving the cross-lingual translation ability of the models.(2)This thesis proposes a data augmentation method based on Discrete Augmented Data Mixing to address the problem of poor translation quality of noisy sentences in neural machine translation.Existing neural machine translation models are very sensitive to noisy data,and even a small amount of noise in the input sentence can significantly reduce the translation quality of the model.To solve this problem,this thesis proposes a noise-based data augmentation method,which synthesizes new training data by linearly interpolating the word embeddings of the discrete altered sentence sequence and the original sentence sequence.The experiments are conducted on three different datasets of Chinese-English,German-English,and English-German of different sizes,and the results show that the proposed method not only improves the translation performance of the model but also enhances the model’s robustness.In addition,experiments are conducted in the semi-supervised scenario by combining with back-translation,and the results show that combining with backtranslation can further improve the translation performance of the model.(3)This thesis proposes a data augmentation method based on inter-layer knowledge distillation to address the issues with back-translation.Back-translation is a typical data augmentation method widely used in semi-supervised scenarios,but its effectiveness highly depends on the quality of the pseudo-parallel corpora generated by the reverse model.When parallel data is scarce,the pseudo-parallel data generated by back-translation models may contain a large number of translation errors,and training on this pseudo-parallel data may compromise the performance of the model.To address this issue,this thesis proposes a new data augmentation method based on back-translation,which is similar to the back-translation process,but improves the quality of the pseudo-parallel corpus by optimizing the model structure.The main idea of this method is to use the rich semantic information in the deep part of the model to guide the shallow part of the network,enabling the shallow part to more accurately capture the semantic information in the sentence and improve the overall translation quality of the model.The experimental results on two datasets,Chinese-English and German-English,demonstrate the superiority of this method over back-translation. |