Font Size: a A A

On Key Technologies For Phrase-Based Statistical Machine Translation

Posted on:2014-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2308330473953776Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine translation (MT) is a technology which has long been desired by human beings for hundreds of years. Ever since the computer was born in 1946, computer scientists and linguists have been dreaming of using computers to generate translation results between languages without any human forces. In recent two decades statistical models have been extensively investigated for MT, and presently the statistical models-based MT (SMT) systems have achieved state-of-the-art performance for many language pairs against other approaches to machine translation. Among all the SMT models, the phrase-based model is the simplest and the most effective one. In this thesis, we improve several key components of a state-of-the-art phrase-based SMT system.For phrase-based SMT, the phrase translation table, as one of the core components, is intended to solve the "word selection" problem. Currently, the process of building a phrase translation table follows a standard paradigm. A typical approach is to heuristically extract all possible phrases that are consistent with the word alignment. However, a straight-forward implementation of this approach probably produces an overabundant number of extracted phrases when we allow the extraction of phrases with arbitrary length. This thesis presents a new phrase extraction approach that recursively composes minimal phrases to learn a compact phrase table, referred to as composing-based phrase extraction method. Experimental results on Chinese-to-English translation demonstrate that the 2-composed method achieves translation performance comparable to typical phrase extraction method with the phrase table downsized by 44.3%.Another important SMT component is a decoder, which performs translation from a source-language sentence to its best target-language counterpart by using various resources, including the translation model, the reordering model, and the language model. Based on the analysis of the CYK algorithm for decoding, we present an optimized cube pruning method which greatly reduces the time and space complexity, and improves the translation speed with comparable translation performance against the baseline.When analyzing the translation results, we further find that many notional words were deleted in the framework of the statistical translation. In this thesis, we add some new features into log-linear model to alleviate this problem.After generating the translation results, the raw translations need to be recased and detokenized which we call post-processing. In this thesis, we present a new recasing method for English sentences, which can be easily implemented in a left-to-right fashion and generate high-quality recasing results.All in all, in this thesis we discussed the key techniques for phrase-based statistical machine translation, including the translation model, the decoder, and the post-processing module, and proposed efficient optimization techniques.
Keywords/Search Tags:statistical machine translation, phrase-based statistical machine translation, phrase extraction, decoder, post-processing
PDF Full Text Request
Related items