Font Size: a A A

Exploring Method Of Domain Adaptation For Statistical Machine Translation

Posted on:2016-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:C SuFull Text:PDF
GTID:2298330467472828Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Statistical machine translation, which is based on statistical model, has been suggested to be the state of the art. It can obtain translation knowledge effectively and set up translation system with good performance rapidly. However current statistical machine translation system shows poor performance when domain changes. On the one hand, when machine translation system is used in Chinese English patent translation tasks, domain changing decreases Chinese word segmentation accuracy, which makes extracting correct translation knowledge become difficult. On the other hand, a large number of new words are introduced by new domain. Thus, existing translation knowledge can not recognize them.To solve the problems above, which results from domain changing, we put emphasis on domain adaptation for statistical machine translation, and attempts to improve the accuracy and coverage of extracted translation knowledge. These methods include domain-adaptive Chinese word segmentation for statistical machine translation and paraphrase technology, both of which aim at improving domain adaptation for statistical machine translation. In this thesis, we present our work in two aspects.(1) To solve the domain adaptation problems in Chinese word segmentation, we implement Chinese word segmentation by exploiting n-gram statistical features in raw corpus and bilingually motivated word segmentation information in parallel corpus, respectively. We further propose a linear model based method to combine multiple results, which provides an effective Chinese word segmentation for different domain statistical machine translation. For evaluation, we conduct experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10Chinese-English patent translation task. Experimental results show that the integrated method brings improvements both in F-measure of the Chinese word segmentation and in BLEU score of the Chinese-English statistical machine translation system.(2) Extending phrase table helps improve the coverage of unknown words from new domain. However, large-scale and high-quality parallel corpus is rare resource. Thus, we introduce additional paraphrase to statistical machine translation to improve domain adaptation. The idea is that the coverage of phrase table in semantic information is higher than that in phrase phenomenon because of diversity of natural language. Thus, unknown words can be transferred into their paraphrase and get a proper translation from phrase table. In this work, we acquire paraphrase knowledge based on a third language, express multiple paraphrases of input sentence in a lattice and modify statistical machine translation decoding algorithm to process the lattice. Experimental results show that, in different scaled training set, the proposed systems always outperformance traditional system, and the proposed one is robust.In summary, to improve domain adaptation of statistical machine translation, this thesis propose two methods to optimize statistical machine translation, including extracting translation knowledge and decoding with translation knowledge. Experimental results show our method brings performance improvement for statistical machine translation in domain adaptation.
Keywords/Search Tags:statistical machine translation, domain adaptation, Chinesesegmentation, paraphrase, lattice
PDF Full Text Request
Related items