Low-Resource Machine Translation Techniques For Distant Language Pair

Posted on:2022-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Zhou

Full Text:PDF

GTID:2518306725493314

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In today's world,the relationship between different regions is increasingly close,and the demand for cross language communication is increasing.Machine translation technology can automatically translate sentences using computer,reducing the difficulty of cross language communication,which is a research hotspot in the field of natural language processing.Neural Machine Translation uses large-scale data to train a complex neural network,which is the main method of machine translation recently.However,in lowresource scenarios,there is often a lack of sufficient parallel corpus,which leads to low-quality machine translation results.Low-resource machine translation technique tries to use monolingual data to help machine translation system in low-resource scenarios,which improves translation performance for close language pairs.However,there are still many difficulties for distant language pairs.From the perspective of translation knowledge acquisition,some language pairs are relatively close,and the performance of unsupervised word translation learning is good enough.However,the gap between monolingual semantic space of distant language pairs is large,therefore,it is difficult to learn high-quality bilingual word translation with a small amount of alignment signals.From the perspective of translation model modeling,the mainstream unsupervised translation model does not consider the phenomenon of word splitting,close language pairs share words or subwords,which helps the translation of split word,but for distant language pairs,alignment of split words is hard to learn and the translation quality of split word is poor.This paper provides solutions for the two problems above,the main work is as follows:In order to solve the problem that it is difficult to learn the word translation relationship in distant language pairs,we propose a bilingual word embedding learning method based on dictionary extraction,this method uses bilingual word embedding and statistical word alignment model to extract dictionary from parallel data,providing more alignment signals than previous methods which use small amount of ground truth dictionary,therefore it is more suitable for distant language pairs.Experimental results show that the proposed method improves the performance of word alignment and bilingual word embedding at the same time.In order to solve the problem of split words in distant language pairs,we propose an unsupervised machine translation model which models split word.We add a word representation combiner to model the split word.We also propose two training tasks to inhance the combined representation.The experimental results show that our method indeed models split words,and improves the translation quality of split words.

Keywords/Search Tags:

Low-Resource Machine Translation, Unsupervised Machine Translation, Cross-Lingual Word Embedding, Word Alignment, Tokenization

PDF Full Text Request

Related items

1	Research On Unsupervised Neural Machine Translation
2	Research On Machine Reading Comprehension Model Based On Cross-lingual Transfer Technology
3	Research On Sentence Alignment Method Based On Cross-lingual Word Embeddings
4	Research On The Application Of Machine Translation In Cross-lingual Document Classification
5	Research On Mongolian And Chinese Machine Translation Based On Monolingual Corpus Training
6	Study On Word Alignment Technology And Construction Of Statistical Machine Translation Platform
7	Research On End-to-end Neural Network Machine Translation
8	Design And Implementation Of Heuristic Analogy Translation Mechanism In IHSMTS
9	Research On Word Alignment In Statistical Machine Translation
10	Research On Chinese Word Segmentation Strategies For Statistical Machine Translation