Domain Adaptation For Statistical Machine Translation

Posted on:2016-06-29

Degree:Master

Type:Thesis

Country:China

Candidate:L Liu

Full Text:PDF

GTID:2308330464452158

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

Statistical Machine Translation(SMT) is an automatic translation technology, relying heavily on large-scale parallel corpus, learning translation knowledge from parallel corpus and training model to finish specific translation task. Currently, most of SMT systems trained on large-scale mixed-domain parallel corpus perform well in the translation of different domain text. However, the general-domain system obtains lower performance when it is used to address domain-specific task, such as Spoken Language Translation. It is because the general-domain system does not adapt the translation model to match the test set according to the domain-specific knowledge and target domain expressions, which degrades the performance of translation system. For this, we study the domain adaptation for SMT and propose novel method to solve this problem. Therefore, the research contents are as follows:1) General-domain parallel corpus constructionParallel corpus construction is to collect bilingual translated text, which is indispensable resource to domain adaptation for SMT. To build large-scale parallel corpus, we propose an iterative link-based method for parallel web page identification. The approach combines internal information with external information of web page to find parallel web page within bilingual website. Experiments show that comparing with the baseline system, the optimized system improves by 6.2% in F-score on the test set, demonstrating the effectiveness of the method.2) In-domain sentence pairs selectionSentence pair selection is to address the lack of sufficient bilingual text for SMT in the domain of interest. It aims at mining sentence pairs that are most relevant to target domain from general-domain parallel corpus to expand domain-specific translation model training data. For this, we propose three novel methods to select domain-relevant sentence pairs which are based on the combination of translation model and language model trained on small-scale parallel corpus. These approaches are effective to measure the mutual translation and domain relevance of sentence pair. Experiments show that our methods outperform previous methods. When the selected sentence pairs are evaluated on an end-to-end machine translation task, our methods can increase the translation performance by nearly 3 BLEU points.3) Fusion of general-domain and in-domain translation modelSentence pair selection is to mine Top N domain-relevant bilingual text from general-domain parallel corpus and then uses the refined data to train domain-specific SMT system. However, optimizing the value of N is a challenging task. For this, we study domain adaptation for SMT at the model level and propose a phrase weight method to combine general-domain and in-domain translation model. The approach is to adjust the general-domain translation model to match target domain task, improving adaptation of general-domain translation system. Experiments show that comparing with the baseline systems, the adapted general-domain systems improve by 2 BLEU points on the test set.

Keywords/Search Tags:

Statistical Machine Translation, Domain Adaptation, Sentence Pair Selecton, Fusion of transaltion models

PDF Full Text Request

Related items

1	Domain Adaptation For Statistical Machine Translation
2	Research On Semantics Analysis-based Domain Adaptation Reinforcement Method For Machine Translation
3	Exploring Method Of Domain Adaptation For Statistical Machine Translation
4	Research On Domain Adaptation For Statistical Machine Trans- Lation
5	Study On Technology Of Corpus Processing And Its Quality Evaluation For Statistical Machine Translation
6	Domain Adaptation For Statistical Machine Translation
7	Research On Domain Adaptation In Statistical Machine Translation Based On Clustering
8	Research On Some Key Aspects Of Statistical Machine Translation
9	Exploring Method Of The Construction Of Parallel Corpus For Machine Translation In A Specific Domain
10	Continuous-Space Based Statistical Machine Translation