Font Size: a A A

The Research Of Phrase Extraction Technology For Tibetan And Chinese Statistical Machine Translation

Posted on:2014-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:X F DongFull Text:PDF
GTID:2268330425470661Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Statistical machine translation includes two translation model,one of the key steps in the phrase-based translation model is the extraction of the bilingual phrase. How to extract bilingual phrase accurately and sufficiently become the focus of the study. The Och phrase extraction algorithm is based on the large-scale bilingual parallel corpora as the cost to balance the accuracy and recall rate.But the Tibetan-Chinese parallel corpora size is limited. That leads to serious data sparseness problems in the training of the translation model. So how to solve the problem become the focus of the research.In this paper, with the introduction of the development of statistical machine translation and the system, using Moses、Srilm and the word alignment tool GIZA++,and using GIZA++to get the Tibetan-Chinese bilingual parallel corpus. Making the use of Mose to complete the whole training of the translation model. By improving the phrase extraction algorithm to get the final phrase translation probability table.Improving the phrase extraction algorithm on the base of the Och. By considering a Tibetan word aligned to many Chinese in the word alignment matrix.Using the Och to extract phrase pairs and by adding dictionary information to extract the phrase which does not meet the conditions. If the dictionary contains the phrase pairs, extract it, or give up.By using the Och phrase extraction algorithm and the improved phrase extraction algorithm to make experiments. Experiments were performed on the same scale of different data and different size with the same kind corpus. We can see from the experimental results, the improved phrase extraction algorithm can extract more Tibetan-Chinese bilingual phrase pairs than the Och phrase extraction algorithm. It is by improving the recall of Tibetan-Chinese phrases to improve the translation quality model. That has important significance for the treatment of the small scale of bilingual parallel corpus.
Keywords/Search Tags:statistical machine translation, phrase extraction, linguistic model, translation model, Tibetan-Chinese, bilingualphrase pair
PDF Full Text Request
Related items