Font Size: a A A

Study On Technology Of Corpus Selection For Statistical Machine Translation

Posted on:2014-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q LuFull Text:PDF
GTID:2268330425991545Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Bilingual training corpus, as an indispensable source of knowledge for statistical machine translation, directly affect the translation performance of systems. In general, the increase in the size of bilingual training data leads to higher translation accuracy. However, as the corpus size increases, the training and decoding cost of machine translation systems also increase. In addition, the noises in the bilingual data can have a negative effect on translation accuracy.In this paper, we discuss the problem of data selection from two different perspectives, data quality and coverage, respectively. The goal is to reduce the size of training data, meanwhile retain the accuracy of translation systems. This way, the cost of training and decoding of machine translation systems can be decreased.We propose multiple features for evaluating the quality of parallel sentence pairs, including the fluency of sentences as well as the likelihood of the sentences in a sentence pair being the translation of each other. We incorporate the features in a linear model and learn feature weights on a labeled dataset by using the Pranking algorithm. Experimental results show that the proposed approach can effectively distinguish between translation sentence pairs of diverse quality and reach an accuracy of83.56%.We propose a model for training data selection targeted at statistical machine translation. In this, we consider both the quality and the coverage of bilingual sentence pairs. Experimental results on the CWMT and NIST datasets show that when a machine translation system selectively uses20%of the whole training dataset, its accuracy can reach97%of the accuracy achieved by using the whole training dataset. When the selected subset is increased to30%, the resulting translation accuracy is comparable with or even higher than the accuracy achieved with the whole training data.We propose to improve the accuracy of a machine translation system by integrating the results of quality evaluation into the training process of the system. Experimental results show that this approach can improve translation accuracy, although improvements achieved are marginal.This thesis is mainly focused on corpus selection for statistical machine translation, including corpus scoring (quality evaluation of parallel data) and data selection for building machine translation systems. In the future, we will go further to study other approaches to parallel corpus processing and will also consider the problem of translation model adaptation.
Keywords/Search Tags:statistical machine translation, corpus selection, quality evaluation of sentencepairs, corpus’s coverage, the Pranking algorithm
PDF Full Text Request
Related items