Font Size: a A A

Research On Chinese-mongolian Statistical Machine Translation Method For Limited Domain

Posted on:2018-08-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z X YangFull Text:PDF
GTID:1318330512485568Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the research on Machine Translation(MT)has been received great attention and the performance of MT has been improved a lot.Mongolian is one of Chinese minority languages and the research on Chinese-Mongolian Statistical Machine Translation(SMT)is widely valued by the academic community.However,there are many challenges in Chinese-Mongolian SMT,such as data sparsity,difference of morphology and word order.The artificial corpus is time-consuming and laborious,so it is difficult to expand to a large scale in the short term.Therefore,it is theoretically important and of application value to research on how to effectively improve translation quality based on the research method.This dissertation focuses on the challenges of Chinese-Mongolian SMT in a low-resource setting with five key technologies,which includes Mongolian morphological segmentation with large unlabeled data,morpheme-based weighting,synonym-based reordering model,translation method based on morpheme media and system combination.The main work and the contributions of the dissertation are concluded as follows:1.The dissertation proposes a Mongolian morphological segmentation with both labeled and large unlabeled data to solve the morphology asymmetric.Mongolian is a morphologically rich language while Chinese is an isolated language,and the morphological differences bring great challenges to machine translation.This dissertation presents a novel segmentation method for a practical application,i.e.,statistical machine translation(SMT).First,a CRF-based supervised learning is exploited to predict morpheme boundaries by using labeled data.Then,a lexicon-based segmentation model with labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data.Finally,the dissertation presents error correction method to revise segmentation results.2.The dissertation presents a morpheme-based weighting as a smoothing for phrase translation probability.The data sparsity leads to inadequate translation probability training procedure and the corresponding probability information is not enough to reflect the reliability of the translation between source phrase and target phrase.The dissertation proposes a morpheme-based weighting by decomposing Mongolian word sequence into morpheme sequence,and then makes a better estimation of the phrase translation probability.This method can estimate the translation reliability between source phrase and target phrase more reasonable.Besides,the dissertation integrates the morpheme-based weighting into baseline system by three integration methods and improve the translation performance.3.The dissertation presents a synonym-based reordering model to solve the word order difference between Chinese and Mongolian.Reordering model is the crucial component in Chinese-Mongolian SMT because of data sparsity and word order difference.The dissertation presents a synonym-based reordering model.The key idea of the synonym-based reordering model is that synonymous phrases can share the same reordering instances,so the sufficient data is used for probabilities calculation.Then,the dissertation integrates synonym-based reordering model into baseline SMT as additional feature functions to generate more fluency translation results.4.The dissertation proposes a translation method based on morpheme media to construct new translation knowledge.The scale of phrase translation knowledge extracted from parallel corpus is small because of data sparsity,and thus seriously restricts the performance of SMT.The dissertation treats Mongolian morpheme as pivot language and construct two new SMT systems,which are Chinese-Morpheme SMT system and Morpheme-Mongolian SMT system.A new translation knowledge including phrase translation table and reordering model is induced via these two SMT systems.In addition,this dissertation uses multiple decoding paths and multiple feature functions to incorporate the new translation knowledge into the baseline system.5.The dissertation exploits system combination as a unifying framework for proposed methods.This dissertation proposes morpheme-based weighting,synonym-based reordering model and morpheme media method to solve the problem of unreliable phrase translation probability,word order difference and small scale translation knowledge.It is necessary to find an appropriate framework to combine these methods to further improve translation quality.This dissertation exploits a word level system combination as a unifying framework for proposed methods.The TER metric is used for hypothesis alignment.The experimental results show that the system combination can further improve the translation quality significantly.The training data used in the dissertation includes 67288 sentence pairs to daily language,220 thousand bilingual dictionaries and 500 agricultural bilingual sentence pairs.Besides,the test set used in daily language and agricultural domain are 500 sentence pairs and 200 sentence pairs respectively.The dissertation obtains 2.16 BLEU score points increment in daily language and 3.36 BLEU score points increment in agricultural domain.
Keywords/Search Tags:Limited Domain Machine Translation, Morphological Segmentation, Reordering Model, Translation Model, System Combination
PDF Full Text Request
Related items