Font Size: a A A

Research On Unsupervised Domain Adaptation Of Mongolian-Chinese Machine Translation Model Based On Fine Tuning

Posted on:2022-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:T T FanFull Text:PDF
GTID:2518306542976569Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Machine Translation is one of the most important research directions in natural language processing,aiming in studying the automatic translation method,which enables a computer translate one natural language to another one.As the study of deep learning goes further and further,and because of the broader and deeper applications of neural networks used in the field of machine translation and the tireless researches of scholars,the performance of neural machine translation models has always been improving.At present,the high performance of neural machine translation mainly focuses on the language pairs with rich data,because as a data-driven method,the performance of neural machine translation is highly dependent on the quality and scale of corpus.For language pairs with sparse data,neural machine translation needs to find a new way.So far,the research on language pairs with poor corpus has mainly focused on monolingual corpus,transfer learning and other methods.In addition,it is very common that the marginal distribution probabilities of training data and test data are different in practical application.Therefore,in order to alleviate the problem of sparse corpus and mismatch of marginal distribution probability between training data and test data,this thesis studies unsupervised domain adaptive Mongolian Chinese neural machine translation model based on fine tuning.Firstly,aiming at the lack of Mongolian and Chinese domain corpus,this thesis trains domain-aware feature embedding model with in-domain language pairs of Mongolian and Chinese and out-of-domain language pairs of English and Chinese to gain parameters for Mongolian-Chinese in-domain translation.After transferring parameters to the child model and initiating the child model with the parameters,this model can be used to translate indomain Chinese corpus to in-domain Mongolian corpus,which generates parallel in-domain corpus for the following training.Secondly,the corpus is segmented at the sub-word level.Mongolian belongs to adhesive language,and affixes are usually added to the stem to form new words.Therefore,this thesis adopts the sub-word level segmentation method,and divides the words into character combinations.This segmentation method between word level and character level can not only effectively reduces the appearance of unknown words and vocabulary size,but also preserves the semantic characteristics of words as much as possible.Next,to increase the size of training data and relieves the problem of over-fitting induced by small dataset,this thesis uses the data selection method.Based on in-domain and out-of-domain corpus,the similarity scores are calculated for each sentence in out-of-domain corpus.Data with lower similarity scores than the threshold are selected and introduced into the training of in-domain model.Several different thresholds are set to experiment the best threshold which could lead to the highest BLEU.The experiments show that the application of data selection methods improves the performance of translation model and BLEU increases.Finally,the curriculum learning strategy is used to train the model after the corpus selected by the threshold is sorted according to the similarity scores.The corpus is selected according to a certain probability and is fed into the model in the formality of shards.The translation results are evaluated by the mainstream machine translation standard BLEU.The experiments show that the application of training strategy of curriculum learning improves the performance of translation of the model to some extent.
Keywords/Search Tags:Mongolian-Chinese Machine Translation, Fine Tuning, Domain Adaptation, Similarity, Curriculum Learning
PDF Full Text Request
Related items