Font Size: a A A

Research On Improving The Performance Of Chinese-Uyghur Word Alignment For Statistical Machine Translation

Posted on:2020-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q LiFull Text:PDF
GTID:2438330575496410Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Word alignment technology is the basis of statistical translation models and ordering models in machine translation,which indicates that word alignment technology is the most important part of statistical machine translation systems.At the same time,word alignment technology plays an important role in the field of natural language processing.For example,the construction of bilingual corpus,speech recognition and information retrieval.If the error occurs in the word alignment phase,it will continue the error in these models,which will cause more errors in the model due to unresolved errors in the word alignment phase.At present,the research on Chinese-English word alignment technology started earlier and achieved good results.However,the research on Chinese-Uyghur word alignment technology started late,so Chinese-Uyghur word alignment still faces less corpus and some aligning errors in corpus.At the same time,because morphological structure of Uyghur language is very rich and complex,it brings a very serious data sparse problem to the Chinese-Uyghur word alignment.In addition,during the experiment,we found that there is a problem of misalignment between named and unnamed entities in Chinese-Uyghur word alignment.In this paper,the main research contents are as follows:(1)This paper applies the training corpus filtering method based on perplexity to the preprocessing of Chinese-Uyghur bilingual word alignment corpus.Get a better Chinese-Uyghur bilingual corpus by deleting pairs of serious errors.Perplexity can remove bilingual-statement pairs with severe errors in the word alignment phase,improving word alignment performance.This paper filters the training corpus through the perplexity of Chinese-Uyghur bilingual alignment sentences,and compares the influence of different perplexity on the alignment effect of Chinese-Uyghur words.Experiments show that the performance of Chinese-Uyghur word alignment is effectively improved when the perplexity threshold is less than 12.(2)This paper introduces a morphological segmentation-based algorithm to implement Chinese-Uyghur word alignment corpus preprocessing.On the basis of the segmentation of Uighur nouns and verbs,the segmentation of Uyghur adjectives is added.The segmentation can obtain the dimensional sentences containing more semantic information,which can solve the problem of data sparsity to a certain extent.Improve the performance of Chinese-Uyghur word alignment.(3)A method based on the recognition of bilingual named entities to improve the alignment performance of Chinese-Uyghur words.Firstly,the named entities in the bilingual language are identified by the CRF method;then,the named entities with the bilingual language mark are replaced,and then the Chinese-English bilingual word alignment experiment is performed on the replaced bilingual corpus;finally,the experimental results are marked.Replaced named entity recovery.This method achieves an improvement in the alignment performance of the Chinese-Uyghur word.
Keywords/Search Tags:Chinese-Uyghur word alignment, Perplexity, Morphological segmentation, GIZA++
PDF Full Text Request
Related items