Font Size: a A A

Research On Key Technologies Of Tibetan-Chinese(Chinese-Tibetan) Machine Translation Based On Deep Learning

Posted on:2024-03-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J D Z SangFull Text:PDF
GTID:1528307361482664Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Machine translation refers to the use of computer technology to convert one nat-ural language into another.For more than 70 years,machine translation research has undergone several iterations,evolving from rules-based methods,statistical approaches to neural networks.As one of the most active research subjects within the field of nat-ural language processing,machine translate has produced huge social benefits and eco-nomic value.Consequently,it has become a pivotal technology within the field of con-temporary artificial intelligence,holding a broad prospect for future development.Ti-betan natural language processing,as an important part of natural language processing for ethnic minorities in China,has made significant progress due to the focused efforts of the party and government.In recent years,theories and technologies such as neural net-works,pre-training,and large language models have found widespread application in the field of natural language processing.These advancements introduce both vast opportuni-ties and notable challenges to Tibetan-related machine translation research.However,at this stage,Tibetan-Chinese machine translation research and application face challenges including a limited scale of Tibetan-Chinese parallel data,uncertain data quality,unre-fined data processing methods and tools,homogenous data distribution,and a lacking focus on pre-training model research.This thesis proposes solutions to mitigate these issues through a series of studies,which focus on quality evaluation and filtering meth-ods for Tibetan-Chinese parallel data,Tibetan-Chinese word segmentation methods,data augmentation methods based on self-learning and back-translations strategies,and pre-training method for Tibetan-Chinese machine translation model via dictionary-injection.This thesis’s main contributions are summarized as follows:Addressing the significant uncertainty in the quality of Tibetan-Chinese parallel data,this thesis proposes an evaluation and filtering method for Tibetan-Chinese parallel cor-pora via a negative sampling technique.Initially,positive and negative data samples are generated through negative sampling.Following this,a comprehensive end-to-end Tibetan-Chinese parallel sentence pair classifier is developed using a neural network framework to estimate the probability of a given sentence being parallel.This estimate serves as an indicator for data quality evaluation.Finally,several experiments are con-ducted to assess the data quality of Tibetan-Chinese parallel sentence pairs.The data filtering in the Tibetan-Chinese neural machine translation task led to an increase of 3.7BLEU scores,which affirmed the effectiveness of this method.Aiming at the shortcomings of existing data processing methods and tooling for Tibetan-Chinese machine translation,this thesis proposes a Tibetan word segmentation method based on neural networks to efficiently enhance the influence of Tibetan word segmentation performance on Tibetan-Chinese machine translation tasks.Initially,a neu-ral network framework for Tibetan word segmentation and a customized tagging scheme suitable for Tibetan functional suffixes are designed.Subsequently,by leveraging unla-beled and labelled Tibetan textual data,an end-to-end Tibetan word segmentation model is trained using both supervised and unsupervised learning fashions.The precision,recall rate,and1score of the word segmentation model achieve 93.4%,94.2%,and 94.1%,respectively,validating the effectiveness of the model.Aiming at the issues of restricted distribution and limited domain of Tibetan-Chinese parallel data.This thesis proposed a method for augmenting training data for Tibetan-Chinese machine translation based on self-learning and back-translation.Specifically,a large-scale monolingual data is introduced in the framework of Tibetan-Chinese neu-ral machine translation,and the training data is augmented by alternately iterative self-learning and back-translation,which gradually improves the performance of the forward and backward models,thereby improving the generalization performance of the general domain of the model.Experiments show that the performance of the self-learning model and the back-translation model is higher than that of the Transformer baseline model,with3.1 BLEU values and 8.2 BLEU values respectively.Addressing the dearth of pre-trained model research in Tibetan-Chinese machine translation tasks,this thesis proposes a pre-training method for Tibetan-Chinese machine translation based on dictionary injection.This approach is inspired by the tendency of individuals to use mixed multilingual vocabulary and phrases to enhance communication efficiency in cross-lingual interactions.This paper posits dictionary injection as an ef-fective method of noise addition,and pre-trains the Tibetan-Chinese machine translation model with noise-reducing as the learning objective.This method provides the model with the potential for affordable and extensive bilingual knowledge associations during the pre-training stage.Comparative experiments with strong BART baseline reveal that the BLEU scores of this method in both Tibetan-Chinese and Chinese-Tibetan translations are higher by 2.3 and 2.1,respectively.
Keywords/Search Tags:Tibetan-Chinese Machine Translation, Pretraining, Tibetan Word Segmentation, Negative Sampling, Self-learning, Dictionary Injection
PDF Full Text Request
Related items