Font Size: a A A

Study On Technology Of Corpus Processing And Its Quality Evaluation For Statistical Machine Translation

Posted on:2012-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:S J YaoFull Text:PDF
GTID:2248330395958252Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, machine translation based on statistical methods has occupied the leading position, and a variety of statistical machine translation (SMT) systems have appeared, such as phrase-based system, hierarchical phrase-based system and syntax-based system. Each system has its own characteristics, and they have shown good performance in different domains. For these SMT systems, corpus is an indispensable resource. As the seeing goes, one cannot make bricks without straw. SMT system also can not finish the translation without corpus as its training data. So in the paper we focus on Pre-processing of corpus and the evaluation construction and optimization of bilingual parallel corpus for Statistical Machine Translation. Besides, the work about parallel term resources automatic acquisition is simply introduced.The work of corpus pre-processing is an important task in SMT though cockamamie. This paper introduced the process of corpus pre-processing and related technology from the viewpoint of traditional, especially for patent corpus. Some related problems and the influence of pre-processing on the performance of Machine Translation are also simply discussed in this paper.Aimed at the problem of low quality of bilingual parallel corpus and the need of training-set construction with high quality for SMT, we compared three evaluation methods. These methods including method based on bilingual dictionary, evaluate the sentence pair from the perspective of Loyalty and fluency. Experiments showed that it could effectively measure the quality of sentence pairs.Bilingual Terminology is also important resources for machine translation and other filed of NLP. So we also introduce acquisition and construction of Bilingual Terminology based on web and academic literature database in our work. Using the automatic method, millions of bilingual term pairs have collected for machine translation. Certainly, these resources can be also used for Chinese Segmentation, Information Retrieval and so on.In this paper, we put forward an effective method for training data selection, which is based on sentence quality and coverage of selected training-set. Through the experiments we prove the effectiveness of this method. Experimental results on CWMT2008Chinese-to-English MT task show that our framework is effective to select a subset from the large training data set. Even trained on the20%data selected by our framework, the SMT system can achieve comparable performance with the baseline system (using all the training data). What’s more, we also use the method in specific practical applications, such as for CWMT2011(The7th China Workshop on Machine Translation). We provide one million bilingual corpus which will be used as training set on its ch-en translation task. Related experiment also showed that the method is advanced and effective.It is necessary to optimize the training-set when test set is given. Some researches have studied the method of optimization. The main idea is adding the sentence pairs of the original staining-set which is regarded as more important or more similar to test-set to the original data. Then use it as new training-set for SMT system for the sake of improving the performance of training. Based on this idea, in this paper we proposed two different methods to select the sentence which is related to the given test-set. Then, by adding these selected sentence pairs to the original training-set, the weight of each training sentence pairs is redistributed. In our experiment, both of the two methods achieved satisfactory results.Based on the work mentioned before, this paper argues that the domain, similarity of content, fluency and loyalty of sentence pairs, coverage and so on should be synthetically considered when we construct or optimize a training-set for SMT system. Meanwhile, the method we use should be selected according to the specific task.
Keywords/Search Tags:statistical machine translation, constriction of training set, selection of corpus, Pre-processing, coverage, evaluation of sentence pair, resource of term translation
PDF Full Text Request
Related items