Font Size: a A A

Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation

Posted on:2014-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:H C LiangFull Text:PDF
GTID:1268330392472602Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Inthe1990’s,Brownetal.(IBMWatsonCenter)proposedasource-channelmodelfor statistical machine translation, which made significant improvement over the tradi-tional rule based translation methods. Och et al.(RWTH Aachen University) releaseda open-source toolkit named GIZA++for word alignment according to Brown’s work,since when statistical machine translation became a popular issue in the research fieldof natural language processing.Och attributed the translation errors to4categories, which are Bayes errors, mod-eling errors, training errors, and searching errors during decoding. This thesis startsfrom the model training process. By means of analysing all the errors that may becaused in the training process of statistical machine translation, we explore some ef-fective methods to handle these errors. To be more specific, this thesis is focused onthe following aspects:(1) Improvements on word alignment: the aligned sentence pairs in the trainingcorpusforstatisticalmachinetranslation, aremostlyextractedfromthecorporaalignedon document level. Thus the training corpus contains large amounts of wrong align-ments. Weproposeaperplexitybasedmethodtofiltertheincorrectsentencepairsfromthe training corpus. To fix the problem caused by mono-directional alignments on low-frequency words, we also present a discriminative word alignment algorithm base onthe features of IBM model4.(2)Improvementsonphrase extraction: inorder to getmorephrase-based transla-tion rules from the training corpus, we propose non-strict phrase extraction and anothermethod that performs phrase extraction on multiple word alignment results. Both thetwo methods can get more translation rules from the training corpus along with moreerrors. We explore an effective method to filter these rules, so that we can eliminatemost of the incorrect rules without badly affecting the quality of translation results.(3) Improvements on reordering models: The translation rule set extracted fromthe training corpus is too sparse. The statistics for certain reordering phenomena maynot be sufficient. We propose a syntax based reordering model. Since the number ofPOStagsandsyntactictagsismuchmoresmallerthanthenumberofwords, reorderingrules based on these tags would be well covered in the training corpus. The reordering model base on syntactic tags will be more accurate.(4) Improvements on tuning the model feature weights: the most popular param-eter tuning approach in statistical machine translation is minimum error rate training.We propose a forced decoding feature for the decoding stage in minimum error ratetraining. The n-best translations produced by forced decoding method are more similarto the references in the development set. In the tuning stage of minimum error ratetraining, the n-best translations produced by forced decoding method can prevent thetraining process from converging at a bad local optimum point.
Keywords/Search Tags:statistical machine translation, word alignment, phrase extraction, reorder-ing model, minimum error rate training
PDF Full Text Request
Related items