Font Size: a A A

Research And Implementation Of Uyghur-Chinese Machine Translation Based On Data Augmentation Technology

Posted on:2022-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z G PanFull Text:PDF
GTID:2518306542955479Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of human society and the increasingly frequent exchanges between countries in the world,the demand for conversion between languages and the requirements of conversion speed are constantly improving.Human translation is time-consuming and expensive,so machine translation has emerged from it.With the rapid development of deep neural network technology,the research of machine translation in academia and industry has gradually changed from traditional statistical machine translation to neural machine translation.Training a high-performance neural machine translation model requires large-scale and high-quality parallel corpus,but it is difficult to collect high-quality and large-scale parallel corpus between Uyghur and Chinese,which hinders the development of Uyghur-Chinese machine translation.According to the characteristics of Uyghur and Chinese language and the method of neural machine translation,this article mainly completes the following three parts:(1)The shortcomings of traditional word segmentation are analyzed.When dealing with Uyghur language,a language with limited resources,the participle performance is low,resulting in the generation of a large number of rare words,and the correct word information cannot be learned in subsequent translation tasks,leading to poor performance of the trained Uyghur-Chinese translation model.Reference in this dissertation,based on the sub-word segmentation strategy,comparing with the traditional segmentation,byte pair encoding segmentation and unigram language model segmentation in Uyghur-Chinese translation tasks.The experiment shows that the subword segmentation strategy can improve its performance.And the byte pair encoding segmentation performs best in Uyghur-Chinese translation task.(2)A data augmentation method incorporating part-of-speech information is proposed.In order to solve the problem of insufficient parallel corpus,this dissertation through a large-scale Chinese corpus training word vector model,according to the word vector model combining part-of-speech information generated semantically related words,combined with the word alignment model parallel corpora expansion of two languages,as well as to the new Chinese grammar error correction.In addition,the500,000 Uighur monolingual corpus and 500,000 Chinese monolingual corpus are used to complete Uyghur-Chinese translation tasks through the translation strategy of iterative back translation.The final model performance is improved by 3.45 BLEU points compared with the baseline Transformer.The experimental results show that the data augmentation method that incorporates part-of-speech information and the iterative back translation strategy proposed in this dissertation can effectively improve the performance of Uyghur-Chinese machine translation.(3)Based on the above research results,this dissertation has implemented a UyghurChinese translation system based on the B/S architecture.After accuracy and concurrency tests and comparison with the Google translation,the Uyghur-Chinese translation system built in this dissertation has been able to meet certain practical application requirements.
Keywords/Search Tags:Uyghur-Chinese translation, neural machine translation, data aug mentation, part-of-speech information, iterative back translation
PDF Full Text Request
Related items