Font Size: a A A

Researches On Technologies Of Diglossia Parallel Corpus Selection Automation For Statistical Machine Translation

Posted on:2016-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y FengFull Text:PDF
GTID:2348330542486959Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Currently,the statistical machine translations gradually become the hot spot in natural language study.The performance of translation not only was affected by the decoding algorithm but also restricted by the quality of bilingual parallel corpus.However,with the rapid development of computer network,corpus has been expanded in its scale,but lead to the price of decoding increase at the same time.Moreover,the spotty quality of corpus may cause noise data.Aiming at these phenomenons and problems,the goal of this thesis is to get high quality small scale corpus subsets.Firstly,this thesis uses the quality and coverage of the sentences as the evaluation standard.To evaluate the quality of the sentences,this thesis proposed the calculation method of evaluation characteristics of smoothly and accurately.Secondly,we propose a lineage model that synthesizes all quality evaluations and a method to obtain the weight vector of lineage model automatically.Experiments show that,the method which was proposed in this thesis can distinguish the quality of the sentence effectually and the accuracy being 84.92%.Forthermore,this thesis continues to go on analysising the effect of coverage factors for corpus selection,and also presents the coverage calculation method based on the phrase.Experimental results show that the coverage has some effect on the results of training data.This is a part that can not be neglected in the corpora selection.Finally,this thesis proposes a method to select small scale high quality training corpus subsets based on the quality level and coverage ratio contribution.The results show the rationality and efficiency of our method,and the effect of noise data on statistical machine translation remains to be studied.To address these issues that how to select high quality small training corpus,this thesis carries out a series of research work,such as corpus evaluation characteristics calculation,corpus artificial scoring,feature weight learning,comprehensive evaluation,coverage contribution value calculation and Corpus selection model.In the future work,we will continue to develop and improve the adaptability of the statistical machine translation model.
Keywords/Search Tags:diglossia parallel corpus selection automation, quality evaluation, lineage model, Perceptron algorithm, coverage ratio contribution factor, high quality corpus selection model
PDF Full Text Request
Related items