Researches On Technologies Of Diglossia Parallel Corpus Selection Automation For Statistical Machine Translation

Posted on:2016-02-22

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Feng

Full Text:PDF

GTID:2348330542486959

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Currently,the statistical machine translations gradually become the hot spot in natural language study.The performance of translation not only was affected by the decoding algorithm but also restricted by the quality of bilingual parallel corpus.However,with the rapid development of computer network,corpus has been expanded in its scale,but lead to the price of decoding increase at the same time.Moreover,the spotty quality of corpus may cause noise data.Aiming at these phenomenons and problems,the goal of this thesis is to get high quality small scale corpus subsets.Firstly,this thesis uses the quality and coverage of the sentences as the evaluation standard.To evaluate the quality of the sentences,this thesis proposed the calculation method of evaluation characteristics of smoothly and accurately.Secondly,we propose a lineage model that synthesizes all quality evaluations and a method to obtain the weight vector of lineage model automatically.Experiments show that,the method which was proposed in this thesis can distinguish the quality of the sentence effectually and the accuracy being 84.92%.Forthermore,this thesis continues to go on analysising the effect of coverage factors for corpus selection,and also presents the coverage calculation method based on the phrase.Experimental results show that the coverage has some effect on the results of training data.This is a part that can not be neglected in the corpora selection.Finally,this thesis proposes a method to select small scale high quality training corpus subsets based on the quality level and coverage ratio contribution.The results show the rationality and efficiency of our method,and the effect of noise data on statistical machine translation remains to be studied.To address these issues that how to select high quality small training corpus,this thesis carries out a series of research work,such as corpus evaluation characteristics calculation,corpus artificial scoring,feature weight learning,comprehensive evaluation,coverage contribution value calculation and Corpus selection model.In the future work,we will continue to develop and improve the adaptability of the statistical machine translation model.

Keywords/Search Tags:

diglossia parallel corpus selection automation, quality evaluation, lineage model, Perceptron algorithm, coverage ratio contribution factor, high quality corpus selection model

PDF Full Text Request

Related items

1	Study On Technology Of Corpus Selection For Statistical Machine Translation
2	Study On Technology Of Corpus Processing And Its Quality Evaluation For Statistical Machine Translation
3	Research On Acquiring Bilingual Parallel Sentences And Building Corpus
4	Research Into Testing Method Of The Large-scale Corpus Segmentation Quality
5	Research On Key Technologies Of Parallel Corpus Construction In Machine Translation Based On Pre-Training Model
6	Research And Application Of Chinese Word Segmentation Based On English-Chinese Parallel Corpus
7	Research On Named Entity Equivalents Automatic Acquisition Method Based On English-Chinese Parallel Corpus
8	Research Of Data Source Selection With Similar Theme In Deep Web Integrated System
9	Design And Implementation Of Automatic Construction System Of English-chinese Parallel Corpus
10	Research On Chinese Spoken Term Detection Technology For News Corpus