Font Size: a A A

Low-Resource Machine Translation Techniques For Distant Language Pair

Posted on:2022-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z H ZhouFull Text:PDF
GTID:2518306725493314Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In today's world,the relationship between different regions is increasingly close,and the demand for cross language communication is increasing.Machine translation technology can automatically translate sentences using computer,reducing the difficulty of cross language communication,which is a research hotspot in the field of natural language processing.Neural Machine Translation uses large-scale data to train a complex neural network,which is the main method of machine translation recently.However,in lowresource scenarios,there is often a lack of sufficient parallel corpus,which leads to low-quality machine translation results.Low-resource machine translation technique tries to use monolingual data to help machine translation system in low-resource scenarios,which improves translation performance for close language pairs.However,there are still many difficulties for distant language pairs.From the perspective of translation knowledge acquisition,some language pairs are relatively close,and the performance of unsupervised word translation learning is good enough.However,the gap between monolingual semantic space of distant language pairs is large,therefore,it is difficult to learn high-quality bilingual word translation with a small amount of alignment signals.From the perspective of translation model modeling,the mainstream unsupervised translation model does not consider the phenomenon of word splitting,close language pairs share words or subwords,which helps the translation of split word,but for distant language pairs,alignment of split words is hard to learn and the translation quality of split word is poor.This paper provides solutions for the two problems above,the main work is as follows:In order to solve the problem that it is difficult to learn the word translation relationship in distant language pairs,we propose a bilingual word embedding learning method based on dictionary extraction,this method uses bilingual word embedding and statistical word alignment model to extract dictionary from parallel data,providing more alignment signals than previous methods which use small amount of ground truth dictionary,therefore it is more suitable for distant language pairs.Experimental results show that the proposed method improves the performance of word alignment and bilingual word embedding at the same time.In order to solve the problem of split words in distant language pairs,we propose an unsupervised machine translation model which models split word.We add a word representation combiner to model the split word.We also propose two training tasks to inhance the combined representation.The experimental results show that our method indeed models split words,and improves the translation quality of split words.
Keywords/Search Tags:Low-Resource Machine Translation, Unsupervised Machine Translation, Cross-Lingual Word Embedding, Word Alignment, Tokenization
PDF Full Text Request
Related items