Font Size: a A A

Research On Domain Adaptation In Statistical Machine Translation Based On Clustering

Posted on:2014-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:W W ZhangFull Text:PDF
GTID:2268330422450584Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In statistical machine translation (SMT), the domain of data has a significanceinfluence on the performance of a translating system. When the training data andtesting data come from the same or similar domain, the SMT system performs well.Otherwise, the translation quality will be degraded.With the development of Internet, more and more bilingual sentences arecollected automatically from the World Wide Web. An important attribute of this datais that the domain information is not given ahead of time. In this paper, we will beconcerned with the domain adaptation of a SMT system under this scenario.First, we research on the issue of mining topic distribution of bilingual corpus.Based on latent Dirichlet allocation (LDA) model, we proposed two topic modelsthat can incorporate two languages knowledge, that is, bilingual LDA andprojected-LDA. In these two models, each topic is viewed as a domain. We canobtain the topic distribution of each sentence pair in the corpus. This can be thoughtas a soft-clustering process of the bilingual corpus.Second, we study on the adaptation of word alignment. Based on the traditionalword alignment model, we integrate domain information into the training procedureof word alignment, and obtain a domain-specific alignment result. We buildtranslation models on this alignment, and the experiment results show that this canimprove both the performance of word alignment and the SMT system.At last, when we build the translation models of different domains, given atesting sentence, we propose a multi-model decoding strategy. The topic distributionof testing sentences is first explored and then the most similarity translation model ischose. Experiment results show that with this strategy, performance improvementcan be obtained, and achieve the goal of this work in the end.
Keywords/Search Tags:Statistical Machine Translation, Topic model, Word alignmentadaptation, Multi-model decoding
PDF Full Text Request
Related items