Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation

Posted on:2014-07-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H C Liang

Full Text:PDF

GTID:1268330392472602

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Inthe1990’s,Brownetal.(IBMWatsonCenter)proposedasource-channelmodelfor statistical machine translation, which made significant improvement over the tradi-tional rule based translation methods. Och et al.(RWTH Aachen University) releaseda open-source toolkit named GIZA++for word alignment according to Brown’s work,since when statistical machine translation became a popular issue in the research fieldof natural language processing.Och attributed the translation errors to4categories, which are Bayes errors, mod-eling errors, training errors, and searching errors during decoding. This thesis startsfrom the model training process. By means of analysing all the errors that may becaused in the training process of statistical machine translation, we explore some ef-fective methods to handle these errors. To be more specific, this thesis is focused onthe following aspects:(1) Improvements on word alignment: the aligned sentence pairs in the trainingcorpusforstatisticalmachinetranslation, aremostlyextractedfromthecorporaalignedon document level. Thus the training corpus contains large amounts of wrong align-ments. Weproposeaperplexitybasedmethodtofiltertheincorrectsentencepairsfromthe training corpus. To fix the problem caused by mono-directional alignments on low-frequency words, we also present a discriminative word alignment algorithm base onthe features of IBM model4.(2)Improvementsonphrase extraction: inorder to getmorephrase-based transla-tion rules from the training corpus, we propose non-strict phrase extraction and anothermethod that performs phrase extraction on multiple word alignment results. Both thetwo methods can get more translation rules from the training corpus along with moreerrors. We explore an effective method to filter these rules, so that we can eliminatemost of the incorrect rules without badly affecting the quality of translation results.(3) Improvements on reordering models: The translation rule set extracted fromthe training corpus is too sparse. The statistics for certain reordering phenomena maynot be sufficient. We propose a syntax based reordering model. Since the number ofPOStagsandsyntactictagsismuchmoresmallerthanthenumberofwords, reorderingrules based on these tags would be well covered in the training corpus. The reordering model base on syntactic tags will be more accurate.(4) Improvements on tuning the model feature weights: the most popular param-eter tuning approach in statistical machine translation is minimum error rate training.We propose a forced decoding feature for the decoding stage in minimum error ratetraining. The n-best translations produced by forced decoding method are more similarto the references in the development set. In the tuning stage of minimum error ratetraining, the n-best translations produced by forced decoding method can prevent thetraining process from converging at a bad local optimum point.

Keywords/Search Tags:

statistical machine translation, word alignment, phrase extraction, reorder-ing model, minimum error rate training

PDF Full Text Request

Related items

1	Research On Multi-group Parameter Tuning And Decoding In Statistical Machine Translation
2	The Study On Phrase-Based Statistical Machine Translation System
3	Study On Word Alignment Technology And Construction Of Statistical Machine Translation Platform
4	Phrase Alignment Models for Statistical Machine Translation
5	Training Large-Scale Statistical Machine Translation Models On Spark
6	Rule-based And Statistical-based Combination Of Bilingual Parallel Sentence, The Phrase Alignment Method
7	Research On Word Alignment In Statistical Machine Translation
8	The Research Of Phrase Extraction Technology For Tibetan And Chinese Statistical Machine Translation
9	On Key Technologies For Phrase-Based Statistical Machine Translation
10	Morphology-Processing In Chinese-Mongolian Statistical Machine Translation