Research On Domain Adaptation In Statistical Machine Translation Based On Clustering

Posted on:2014-03-23

Degree:Master

Type:Thesis

Country:China

Candidate:W W Zhang

Full Text:PDF

GTID:2268330422450584

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In statistical machine translation (SMT), the domain of data has a significanceinfluence on the performance of a translating system. When the training data andtesting data come from the same or similar domain, the SMT system performs well.Otherwise, the translation quality will be degraded.With the development of Internet, more and more bilingual sentences arecollected automatically from the World Wide Web. An important attribute of this datais that the domain information is not given ahead of time. In this paper, we will beconcerned with the domain adaptation of a SMT system under this scenario.First, we research on the issue of mining topic distribution of bilingual corpus.Based on latent Dirichlet allocation (LDA) model, we proposed two topic modelsthat can incorporate two languages knowledge, that is, bilingual LDA andprojected-LDA. In these two models, each topic is viewed as a domain. We canobtain the topic distribution of each sentence pair in the corpus. This can be thoughtas a soft-clustering process of the bilingual corpus.Second, we study on the adaptation of word alignment. Based on the traditionalword alignment model, we integrate domain information into the training procedureof word alignment, and obtain a domain-specific alignment result. We buildtranslation models on this alignment, and the experiment results show that this canimprove both the performance of word alignment and the SMT system.At last, when we build the translation models of different domains, given atesting sentence, we propose a multi-model decoding strategy. The topic distributionof testing sentences is first explored and then the most similarity translation model ischose. Experiment results show that with this strategy, performance improvementcan be obtained, and achieve the goal of this work in the end.

Keywords/Search Tags:

Statistical Machine Translation, Topic model, Word alignmentadaptation, Multi-model decoding

PDF Full Text Request

Related items

1	Morphology-Processing In Chinese-Mongolian Statistical Machine Translation
2	Study On Word Alignment Technology And Construction Of Statistical Machine Translation Platform
3	A Stastical Machine Translation System Between Mongolian And Chinese
4	Research On Bilingual Corpus-Based Machine Translation
5	The Application Research Of Word Sense Disambiguation In The Statistical Machine Translation
6	The Research On English-Chinese Name Entity Translation
7	Research On Chinese Word Segmentation Strategies For Statistical Machine Translation
8	Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation
9	Training Large-Scale Statistical Machine Translation Models On Spark
10	Morphology Modeling for Statistical Machine Translation