Font Size: a A A

Research On Nonparametric Bayesian Based Multi-language Names Transliteration

Posted on:2014-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:T T LiFull Text:PDF
GTID:2268330422950587Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the arrival of information age, the internet contains large scale ofinformation about culture, technology, life and entertainment. In order to make theuser get and understand these multi-language editing information, the naturallanguage process technologies such as machine translation, cross languageinformation retrieval and extraction are required urgently. Transliteration is animportant basic technology for machine translation, cross language questionanswering, cross language information retrieval and extraction. Most part of namedentities and out-of-vocabulary words are transliterated from other languages, andname’ transliteration is the main part of transliteration. In our work, we focus onstatistical based name’ transliteration method, and combine the rule based methodsto realize multi-language name transliteration.Machine transliteration mainly contains rule-based methods andstatistical-based methods. Rule-based methods realize transliteration by using ruleswhich created manually to get syllable segmentation and alignment.Statistical-based methods build the alignment and decoding models by analysing andlearning from a large scale of bilingual corpora, and it is the mainstream approachfor machine transliteration. In this work, we focus on the research ofstatistical-based methods. And we combine our statistical-based method and therule-based methods to build machine transliteration models for English, Spanish,Russian and Japanese to Chinese, and build a online machine transliteration systemfor the four models. The main work and contributions of our work is as following:(1)First, as the transliteration alignment facing over-fitting and name’s originproblems, we propose a new model, coupled Dirichlet process mixturemodel(cDPMM), to deal with the them. In cDPMM, a Dirichlet process is used torealize bilingual segmentation, and a Dirichlet process mixture model is used tocluster name pairs based on the names’ origin(spelling similarity). cDPMM tightlycouples the bilingual segmentation and clustering, it contains the alignment andclustering in a unified model and overcomes over-fitting and names’ origin problemssimultaneously.(2) Second, we utilize the decoder in phrase-based Moses system in machinetransliteration decoding process and fuse features of original distinction degree,ratio of bilingual characters and ratio of bilingual syllables for phrase table. Were-ranking the N-best results of transliteration decoder by using log-linear modelbased on features of N-grams features, ratio of bilingual syllables and ranking ofdecoder. (3)Third, we use our statistical-based method to build English-Chinese andSpanish-Chinese transliteration models, and we combine rule-based method toovercome the lacking of parallel corpus for building Russian-Chinese andJapanese-Chinese transliteration models.(4)Finally, we will build a online system for the four transliteration models.
Keywords/Search Tags:Name transliteration, Name origins, Over-fitting, Dirichlet processmixture, Unsupervised clustering
PDF Full Text Request
Related items