Optimization On Translation Knowledge In Statistical Machine Translation

Posted on:2015-03-16

Degree:Master

Type:Thesis

Country:China

Candidate:X Wang

Full Text:PDF

GTID:2268330428498531

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The performance of statistical machine translation is largely dependent on large-scaletraining bilingual data, because the translation knowledge and language knowledge in thetraining data play a positive role in constructing translation model and language model.However, the problem that translation knowledge contains some redundancy informationand noisy information inevitably appears with the expansion of training corpus scale. It hasserious negative effects on translation model and language model. To this end, this thesisfocuses on optimizing translation knowledge and proposes some novel approaches to solvethe problem. The contributions of this work are summarized as follows:Selection of Training DataA classification based selection approach has been proposed to pick up high-qualitybilingual sentences, using the quality of bilingual sentences as evaluation criterion.Specifically, we first exploit several metrics to find the best and worst sentences in thecorpus. Then we train a classifier with features extracted from the best and worst sentences.Finally, we use the classifier automatically classify the rest sentences. In this way,high-quality bilingual sentences can be automatically extracted from low-quality ones.Experimental results show the proposed approach outperforms the baseline system by0.87BLEU points.Noise Filtering in Translation KnowledgeWhen hierarchical phrase-based statistical machine translation systems are used forspoken language translation, sometimes the translationsâ€™ content words were lost:source-side content words are empty when translated into target texts during decoding. We propose a basic and efficient method for phrase-table filtering, with which the phraseâ€™content words translation are checked to decide whether to use the phrase in decoding ornot. Experimental results on spoken language translation show that our method canalleviate the problem and improve the translation performance at the same time.Topic Information Integration in Translation KnowledgeWe propose a topic-based reordering model using document-level information.Reordering examples are automatically learned from bilingual training data, which areassociated with document-level and word-level topic information induced by topic model.We train a topic-based reordering model over the reordering examples. Finally, weintegrate the reordering model into SMT system. The experimental results on large scaletraining data demonstrate the effectiveness of the proposed model.

Keywords/Search Tags:

Statistical machine translation, Selection of training data, Phrase pairfiltering, Document-level information

PDF Full Text Request

Related items

1	Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation
2	Research On Statistical Machine Translation At Document Level
3	On Key Technologies For Phrase-Based Statistical Machine Translation
4	Research On Phrase-based Statistical Machine Translation
5	Phrase Alignment Models for Statistical Machine Translation
6	The Study On Phrase-Based Statistical Machine Translation System
7	Translation Knowledge Acquisition In Corpus-based Machine Translation
8	The Design And Realization Of A Phrase-based Statistical Chinese-English MTS
9	Research On Some Key Aspects Of Statistical Machine Translation
10	The Research And Application Of Phrase-Based Statistical Machine Traslation System