Font Size: a A A

Domain Adaptation For Statistical Machine Translation

Posted on:2017-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2308330488961975Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Statistical Machine Translation(SMT) is an automatic translation technology, which relies heavily on large scale parallel corpus. It performs translation by statistical analysis which includes translation model building, language model constructing and reordering modeling. Currently, the performance of SMT systems mainly depends on the scale and quality of the training corpus. Generally, large scale and high quality corpus leads to satisfactory translation results. It is because that this corpus contains rich translation knowledge and sufficient linguistic phenomenon. However, the performance of a domain-specific SMT system decreases for the lack of high quality parallel corpus in the domain of interest. The reason is that the existing SMT system is incapable of adapting to the domain-specific translation knowledge and linguistic phenomenon. Therefore, we study on the domain adaptation for SMT and propose a novel method, including the following contents:1) Specific-domain parallel corpus constructionBased on the phenomenon that domain-specific bilingual websites tend to contain large amount of parallel or comparable bilingual texts, we propose a novel method for specific-domain bilingual websites identification. The method devotes to identify those websites automatically based on global retrieval and local classification. And it optimizes the identification process from the aspects of recall and precision. We experiment on the domain of electronic devices and obtain a total of 18,944 websites in the process of global retrieval. The local classification is based on 3,000 samples extracted randomly from the obtained websites and annotated manually, which gets a F1_Measure of 85.19%. Additionally, we expand the training set of a specific-domain translation system with bilingual corpus extracted from identified websites and get promising achievements, which verifies the availability of our method.2) Domain relevant sentence pair selectionDomain relevant sentence pair selection is an effective method to solve the lack of high quality bilingual corpus for SMT in the domain of interest. Supervised by priori bilingual knowledge in small scale in-domain training data, this method automatically identifies and extracts domain relevant sentence pairs from large scale general-domain corpus. The selected sentence pairs are then incorporated into the SMT training data. In this paper, we propose to incorporate topic information into sentence pair selection methods. In particular, we propose a topic-based ranking model to introduce the underlying semantic information from topic perspective, which combines two parts: First, bilingual topic distribution is used for discovering the underlying semantic information of a sentence pair. Second, a transition probability is proposed to associate each topic with target domain. When the selected sentence pairs are evaluated on an end-to-end SMT task, our methods can increase the translation performance by nearly 1.64 BLEU points.3) Translaiton model optimization based on specific-domain featuresThe above methods achieve promising results on the study of domain adaptation for SMT. In this section, we further explore to solve the problem in the perspective of translation models. We propose a novel method to exploit domain-specific translation knowledge at the phrase rule level for the SMT system. We incorporate our optimized translation model into the traditional SMT system. Concretely, we propose a Convolutional Neural Network(CNN) based method for the optimization. It first estimates the domain relevance score for each sentence pair in the training corpus. Secondly, we use the score to re-estimate the translation probability of the phrase pairs which are extracted from the training corpus. Finally, we propose a linear fusion method to combine general-domain and in-domain translation model. Experiments show that compared with the baseline system, our method can increase the translation performance by nearly 2.9 BLEU points.In conclusion, this paper studies on the domain adaptation for SMT systems and proposes novel methods in the perspective of domain-specific parallel corpus construction(global retrieval and local classification based domain-specific bilingual websites identification), domain relevant sentence pair selection(adaptation data selection with bilingual topic information) and translation model optimization(CNN based translation model optimization with domain-specific features). Experiments show that our method yields promising results on end-to-end domain-specific SMT tasks.
Keywords/Search Tags:Statistical Machine Translation, Domain Adaptation, Domain Specific Bilingual Websites, Sentence Pair Selection, Translation Model Optimization
PDF Full Text Request
Related items