Font Size: a A A

Optimization On Translation Knowledge In Statistical Machine Translation

Posted on:2015-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2268330428498531Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The performance of statistical machine translation is largely dependent on large-scaletraining bilingual data, because the translation knowledge and language knowledge in thetraining data play a positive role in constructing translation model and language model.However, the problem that translation knowledge contains some redundancy informationand noisy information inevitably appears with the expansion of training corpus scale. It hasserious negative effects on translation model and language model. To this end, this thesisfocuses on optimizing translation knowledge and proposes some novel approaches to solvethe problem. The contributions of this work are summarized as follows:Selection of Training DataA classification based selection approach has been proposed to pick up high-qualitybilingual sentences, using the quality of bilingual sentences as evaluation criterion.Specifically, we first exploit several metrics to find the best and worst sentences in thecorpus. Then we train a classifier with features extracted from the best and worst sentences.Finally, we use the classifier automatically classify the rest sentences. In this way,high-quality bilingual sentences can be automatically extracted from low-quality ones.Experimental results show the proposed approach outperforms the baseline system by0.87BLEU points.Noise Filtering in Translation KnowledgeWhen hierarchical phrase-based statistical machine translation systems are used forspoken language translation, sometimes the translations’ content words were lost:source-side content words are empty when translated into target texts during decoding. We propose a basic and efficient method for phrase-table filtering, with which the phrase’content words translation are checked to decide whether to use the phrase in decoding ornot. Experimental results on spoken language translation show that our method canalleviate the problem and improve the translation performance at the same time.Topic Information Integration in Translation KnowledgeWe propose a topic-based reordering model using document-level information.Reordering examples are automatically learned from bilingual training data, which areassociated with document-level and word-level topic information induced by topic model.We train a topic-based reordering model over the reordering examples. Finally, weintegrate the reordering model into SMT system. The experimental results on large scaletraining data demonstrate the effectiveness of the proposed model.
Keywords/Search Tags:Statistical machine translation, Selection of training data, Phrase pairfiltering, Document-level information
PDF Full Text Request
Related items