Font Size: a A A

Research On Biomedical Domain Adaptation Methods In Neural Machine Translation

Posted on:2022-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ZhangFull Text:PDF
GTID:2480306569959159Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The neural machine translation model trained on the large parallel corpus has achieved excellent performance.Nevertheless,there are very few sentence pairs in parallel corpora in some specific domian,such as biomedicine,military,diplomacy,and other specialist domain,and the neural machine translation model trained on small data has poor performance.With the development of smart medicine,in order to better mine and utilize the international biomedical literature and related achievements,it is of great significance to study the domain adaptation method of neural machine translation in biomedical field,which improve the training effect of biomedical neural machine translation model with the general domain knowledge,and effectively reduce the dependence of the model on Biomedical data.At present,the common domain adaptation method is to train the pre-trained model on the large out-domain data,and then fine tune the model on the small in-domain data.However,the pre-training process consumes a lot of training resources and time,and the quality of out-domain data has a great influence on the effect of pre-trained models,Moreover,fine tuning based on small in-domain data can easily lead to over-fitting.In response to these problems,this thesis builds a biomedical neural machine translation model based on Transformer,introduces the gradual fine-tuning training method into the pre-training process to train the out--domain data efficiently,and proposes a dynamic data enhancement training method to enhance the fine-tuning model.In this paper,firstly,data selection based on text classification is carried out to form a large domain-relate-ranked data.Based on the domain-relate-ranked data,the pretrained model is trained by the method of gradually fine-tuning,and then dynamic data enhancement training is carried out based on the fine-tuning model or pre training model.Experiments show that,compared with the common domain adaptation training method,the pre-training with gradual finetuning and the proposed dynamic data enhancement training method effectively shorten the training time and improve the translation effect,the time spent in pre-training can be reduced by 28% to 39% compared with the time spent in conventional pre-training.Compared with the conventional domain adaptation model,BLEU scores on multiple test sets can be improved by 0.4 to 0.9 points.Due to the large number of professional terminology in the biomedical field,Chinese word segmentation tools often produce word segmentation ambiguity and segmentation errors,which leads to translation ambiguities and errors in the translation model.In order to solve this problem,this paper proposes a data preprocessing method based on a variety of Chinese word segmentation tools,which can segment the Chinese part of biomedical parallel data set in various ways,and different word segmentation vocabularies are extracted from the results of multiple word segmentation.A high-frequency biomedical vocabulary is constructed and applied to the process of subwordization based on biomedical sub-word model.The enhanced biomedical data set based on multiple Chinese word segmentation is regularized based on the BPE(Byte Pair Encoding)dropout.Experiments show that the method in this thesis can enhance the robustness of the translation model of the translation model and improve the performance of translation models effectively,compared with the dynamic data enhancement model without sub-word optimization,the BLEU score of the sub-word optimized model on multiple test sets can be improved by 1.3 to 1.5 points.In addition,this thesis also explores some key factors affecting the translation performance of biomedical domain adaptation machine translation model,including appropriately increasing the number of BPE merge operations,the biomedical model of byte pair encoding(BPE)being applied in both in-domain and out-domain,the biomedical parallel data set being used as validation set used in the pre-training process and the fine-tuning process.The above methods can further improve the translation performance of the domain adaptation model.
Keywords/Search Tags:Biomedicine, domain adaptation, pre-trained model, dynamic data enhancement, sub-word optimization
PDF Full Text Request
Related items