Machine learning approaches for dealing with limited bilingual training data in statistical machine translation

Posted on:2010-02-20

Degree:Ph.D

Type:Thesis

University:Simon Fraser University (Canada)

Candidate:Haffari, Gholamreza

Full Text:PDF

GTID:2448390002977212

Subject:Computer Science

Abstract/Summary:

Statistical Machine Translation (SMT) models learn how to translate by examining a bilingual parallel corpus containing sentences aligned with their human-produced translations. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target languages. There are a large number of languages that are considered low-density, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This thesis covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited.;This dissertation provides two approaches, unified in what is called the bootstrapping framework, to this problem. I assume that we are given access to a monolingual corpus containing large number of sentences in the source language, in addition to a small or moderate sized bilingual corpus. The idea is to take advantage of this readily available monolingual data in building a better SMT model in an iterative manner: By selecting an important subset of these monolingual sentences, preparing their translations, and using them together with the original sentence pairs to re-train the SMT model. When preparing the translation of the selected sentences, if we use a human annotator, then the framework fits into the Active Learning scenario in machine learning. Instead if we use the SMT system generated translations, then we get the self-training framework which fits into the semi-supervised learning scenario in machine learning. The key points that I address throughout this thesis are (1) how to choose the important sentences, (2) how to provide their translations (possibly with as little effort as possible), and (3) how to use the newly collected information in training the SMT model. As a result, we have a fully automatic and general method to improve the phrase-based SMT models for the situation where the amount of bilingual training data is small.;The success of self-training in SMT and many other NLP problems raises the question why self-training works. I investigate this question by giving a theoretical analysis of the self-training for decision lists. I provide objective functions which are motivated by information theory for the resulting semi-supervised learning algorithms. These objective functions provide us with: (1) Insights about why and when we should expect self-training to work well, and (2) Proofs of the convergence of their corresponding algorithms.;The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The complex nature of machine translation task poses severe challenges to most of the algorithms developed in machine learning community for these two learning scenarios. In this thesis, I develop semi-supervised learning as well as active learning algorithms to deal with the shortage of bilingual training data for Statistical Machine Translation task, specific to cases where there is shortage of bilingual training data.

Keywords/Search Tags:

Machine, Bilingual training data, SMT, Sentences, Approaches, Semi-supervised learning

Related items

1	Research On Semi-Supervised Support Vector Machine Learning Methods
2	Research On Semi-supervised Self-training Method
3	Research On Semi-supervised Learning Classification Algorithm
4	Research And Implementation Of Semi-supervised Machine Learning Algorithms For Classifying The Imbalanced Protocol Flows
5	Study On Semi-supervised Recommendation Method Based On Co-training
6	Research On Machine Learning Methods That Exploit Unlabeled Data
7	Semi-supervised training of models for appearance-based statistical object detection methods
8	Research On Network Anomaly Detection Method Based On Semi-supervised Learning Strategy
9	Comparison And Improvement Of Two Methods Based On Semi-supervised Learning
10	Comparison And Improvement Of Two Methods Based On Semi-Supervised Learning