Font Size: a A A

Complementing Performance Of Statistical Machine Translation For Less-Resourced Language Pairs

Posted on:2016-07-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Abraham Tesso NedjoFull Text:PDF
GTID:1318330482967196Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Today, in the era of information and communication technology, where we live in an information obsessed society, different kinds of machine translation systems have been developed for other languages, which have relatively wider use nationally and/or internationally since the last five decades. Unfortunately, Oromo language has not acquired the advantage of such system, even though it has been the working language of the State Government of Oromia, and one of the major languages in Ethiopia and Africa, since there has never been an attempt for such a system. No machine translation of any sort has been seen for Oromo language so far. This study is, therefore, an attempt to develop a simple Oromo-English machine translation, which is the first of its kind in the history of Oromo language. The inspiring motto of the researcher to face these challenges is:if someone does not do it, no one will do it. And I wanted to be that someone. If I have to die, I am comforted by rendering this timely tool in the history of Oromo language as a professional legacy.This dissertation demonstrates that, a promising performance of machine translation system for natural language can be achieved to state-of-the-art accuracy using-statistical machine translation even for scarce language pairs by deploying linguistics annotation of the texts and making fine pre-processing. We discuss the problem of text tokenization, part-of-speech tagging, and deployment of these information on handful of parallel corpus under phrase-based statistical machine translation framework.In addition to morphological, inflectional and word order problems, Oromo language has another difficulty that creates data sparsity. This is about the variation of symbols used to represent hudhaa in Oromo texts. Hudhaa is diacritical marker or glottal symbol in Oromo. So, text tokenization concepts were reviewed and an appropriate tokenizer for Oromo text was developed. Thorough analysis was made on orthography behavior of the language to tackle the challenge of intra-word glottal character-hudhaa. The approach we used in this research has successfully handled this diacritical marker in a uniform manner that would, otherwise, produce wrong tokens. This uniformly marking of hudhaa also reduced data sparsity and increased translation probability of sentences constituting diacritical marker words-hudhaa, in the Oromo-English machine translation system developed by this research.The second scope of the thesis is part-of-speech (POS) tagging. Parts-of-speech are linguistic categories, which are group of words having similar syntactic features, i.e. noun, adjective, verb, adverb, etc. In the thesis, we investigated different methods to learn POS tags. We used the state-of-the-art technique, Maximum Entropy Markov Model, and developed automatic part-of-speech tagger for Oromo language. This method gave us the advantage of adding rules to the algorithm as feature function and as a result produced a good model.Thirdly, this thesis presents our study of exploiting the languages' word class information augmented with some rule-based processing for phrase-based Statistical Machine Translation (SMT). In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data:most words in a given corpus occur at most a handful of times. Particularly, with a highly inflected language such as Oromo, this problem can be more severe. In addition, there is variant nature or use of different symbols for hudhaa (the diacritical marker) in Oromo language which intrudes another severe data sparsity problem. In this work, we show that using fine tokenization of words considering intra-word behavior of words consisting hudhaa, and POS tag to modify the Oromo input sentence and see how it improves Oromo-English machine translation system. The models were trained on a very small parallel corpus of data set (usually unacceptable for normal SMT system) and also the quality of the parallel corpus both in translation and spelling errors were not so good. Yet our final system achieves a BLEU score of 3.11, as compared to 2.78 of the baseline system. The translation of the model was also evaluated using human evaluation method on the parameters of adequacy and fluency. For both parameters, a geometric average of 3.45 and 3.48, respectively, was achieved out of 5 maximum points. Whereas the translation of the baseline system achieved only a geometric average of 3.36 and 3.39, respectively, out of 5 points.
Keywords/Search Tags:Oromo-English MT, Oromo Tokenization, Oromo POS tagger
PDF Full Text Request
Related items