In statistical machine translation (SMT), the domain of data has a significanceinfluence on the performance of a translating system. When the training data andtesting data come from the same or similar domain, the SMT system performs well.Otherwise, the translation quality will be degraded.With the development of Internet, more and more bilingual sentences arecollected automatically from the World Wide Web. An important attribute of this datais that the domain information is not given ahead of time. In this paper, we will beconcerned with the domain adaptation of a SMT system under this scenario.First, we research on the issue of mining topic distribution of bilingual corpus.Based on latent Dirichlet allocation (LDA) model, we proposed two topic modelsthat can incorporate two languages knowledge, that is, bilingual LDA andprojected-LDA. In these two models, each topic is viewed as a domain. We canobtain the topic distribution of each sentence pair in the corpus. This can be thoughtas a soft-clustering process of the bilingual corpus.Second, we study on the adaptation of word alignment. Based on the traditionalword alignment model, we integrate domain information into the training procedureof word alignment, and obtain a domain-specific alignment result. We buildtranslation models on this alignment, and the experiment results show that this canimprove both the performance of word alignment and the SMT system.At last, when we build the translation models of different domains, given atesting sentence, we propose a multi-model decoding strategy. The topic distributionof testing sentences is first explored and then the most similarity translation model ischose. Experiment results show that with this strategy, performance improvementcan be obtained, and achieve the goal of this work in the end. |