Font Size: a A A

Domain Adaptation For Statistical Machine Translation

Posted on:2016-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2308330464452158Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Statistical Machine Translation(SMT) is an automatic translation technology, relying heavily on large-scale parallel corpus, learning translation knowledge from parallel corpus and training model to finish specific translation task. Currently, most of SMT systems trained on large-scale mixed-domain parallel corpus perform well in the translation of different domain text. However, the general-domain system obtains lower performance when it is used to address domain-specific task, such as Spoken Language Translation. It is because the general-domain system does not adapt the translation model to match the test set according to the domain-specific knowledge and target domain expressions, which degrades the performance of translation system. For this, we study the domain adaptation for SMT and propose novel method to solve this problem. Therefore, the research contents are as follows:1) General-domain parallel corpus constructionParallel corpus construction is to collect bilingual translated text, which is indispensable resource to domain adaptation for SMT. To build large-scale parallel corpus, we propose an iterative link-based method for parallel web page identification. The approach combines internal information with external information of web page to find parallel web page within bilingual website. Experiments show that comparing with the baseline system, the optimized system improves by 6.2% in F-score on the test set, demonstrating the effectiveness of the method.2) In-domain sentence pairs selectionSentence pair selection is to address the lack of sufficient bilingual text for SMT in the domain of interest. It aims at mining sentence pairs that are most relevant to target domain from general-domain parallel corpus to expand domain-specific translation model training data. For this, we propose three novel methods to select domain-relevant sentence pairs which are based on the combination of translation model and language model trained on small-scale parallel corpus. These approaches are effective to measure the mutual translation and domain relevance of sentence pair. Experiments show that our methods outperform previous methods. When the selected sentence pairs are evaluated on an end-to-end machine translation task, our methods can increase the translation performance by nearly 3 BLEU points.3) Fusion of general-domain and in-domain translation modelSentence pair selection is to mine Top N domain-relevant bilingual text from general-domain parallel corpus and then uses the refined data to train domain-specific SMT system. However, optimizing the value of N is a challenging task. For this, we study domain adaptation for SMT at the model level and propose a phrase weight method to combine general-domain and in-domain translation model. The approach is to adjust the general-domain translation model to match target domain task, improving adaptation of general-domain translation system. Experiments show that comparing with the baseline systems, the adapted general-domain systems improve by 2 BLEU points on the test set.
Keywords/Search Tags:Statistical Machine Translation, Domain Adaptation, Sentence Pair Selecton, Fusion of transaltion models
PDF Full Text Request
Related items