Font Size: a A A

Domain Adaptation For Statistical Machine Translation

Posted on:2015-12-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L CuiFull Text:PDF
GTID:1108330479478729Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Statistical Machine Translation(SMT) depends largely on the performance of trans-lation modeling,which further relies on data distribution. Usually, many machine learn-ing tasks assume that the data distributions of training and testing domains are similar.However, this assumption does not hold for real world SMT systems. Therefore, weneed to adapt the models according to the data distribution in order to optimize the per-formance. Recently, domain adaptation is an active topic in SMT and aims to alleviatethe domain mismatch between training and testing data. Via proper domain adaptationmethods, better results are obtained using adapted models according to the data.In this thesis, we propose four new methods of domain adaptation for SMT. First,the quality of bilingual data is a key factor in SMT. Low-quality bilingual data tends toproduce incorrect translation knowledge and also degrades translation modeling perfor-mance. Previous work often used supervised learning methods to filter low-quality data,but a fair amount of human labeled examples are needed which are not easy to obtain.To reduce the reliance on labeled examples, we propose an unsupervised method to cleanbilingual data. The method leverages the mutual reinforcement between the sentence pairsand the extracted phrase pairs, based on the observation that better sentence pairs oftenlead to better phrase extraction and vice versa. End-to-end experiments show that theproposed method substantially improves the performance in large-scale translation tasks.Second, domain adaptation for SMT usually adapts models to an individual specificdomain. However, it often lacks some correlation among di?erent domains where com-mon knowledge could be shared to improve the overall translation quality. We proposea novel multi-domain adaptation approach for SMT using Multi-Task Learning(MTL),with in-domain models tailored for each specific domain and a general-domain modelshared by di?erent domains. The parameters of these models are tuned jointly via MTLso that they can learn general knowledge more accurately and exploit domain knowledgebetter. Our experiments on a large-scale translation task validate that the MTL-basedadaptation approach significantly and consistently improves the translation quality com-pared to a non-adapted baseline. Furthermore, it also outperforms the individual adapta-tion of each specific domain.Third, MTL-based domain adaptation utilizes contextual information to disambiguatetranslation candidates. However, it is often limited to contexts within sentence boundaries,hence broader topical information cannot be leveraged. Therefore, we propose anotherapproach to learning topic representation for parallel data using a neural network archi-tecture, where abundant topical contexts are embedded via topic relevant monolingualdata. By associating each translation rule with the topic representation, topic relevantrules are selected according to the distributional similarity with the source text duringSMT decoding. Experimental results show that our method significantly improves trans-lation accuracy in the NIST translation task compared to a state-of-the-art baseline.Fourth, contemporary machine translation systems usually rely on o?ine data re-trieved from the web for individual model training, such as translation models and lan-guage models. In contrast to existing methods, we propose a novel approach that treatsmachine translation as a web search task and utilizes the web on the fly to acquire transla-tion knowledge. This end-to-end approach takes advantage of fresh web search results thatare capable of leveraging tremendous web knowledge to obtain phrase-level candidateson demand and then compose sentence-level translations. Experimental results show thatour web-based machine translation method demonstrates very promising performance inleveraging fresh translation knowledge and making translation decisions. Furthermore,when combined with o?ine models, it significantly outperforms a state-of-the-art phrase-based statistical machine translation system.The methods proposed in this thesis have addressed several domain adaptation prob-lems, including data cleaning for domain adaptation, multi-domain collaborative train-ing and modeling, open-domain topic adjustment and real-time open-domain translationknowledge acquisition. It supports large-scale SMT systems trained with big data, whichmakes significant progress. Meanwhile, our research also provides new methodologiesand perspectives for domain adaptation research in the future.
Keywords/Search Tags:statistical machine translation, domain adaptation, multi-task learning, deep learning, web search
PDF Full Text Request
Related items