Research On Key Technologies Of Tibetan-Chinese(Chinese-Tibetan) Machine Translation Based On Deep Learning

Posted on:2024-03-10

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J D Z Sang

Full Text:PDF

GTID:1528307361482664

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Machine translation refers to the use of computer technology to convert one nat-ural language into another.For more than 70 years,machine translation research has undergone several iterations,evolving from rules-based methods,statistical approaches to neural networks.As one of the most active research subjects within the field of nat-ural language processing,machine translate has produced huge social benefits and eco-nomic value.Consequently,it has become a pivotal technology within the field of con-temporary artificial intelligence,holding a broad prospect for future development.Ti-betan natural language processing,as an important part of natural language processing for ethnic minorities in China,has made significant progress due to the focused efforts of the party and government.In recent years,theories and technologies such as neural net-works,pre-training,and large language models have found widespread application in the field of natural language processing.These advancements introduce both vast opportuni-ties and notable challenges to Tibetan-related machine translation research.However,at this stage,Tibetan-Chinese machine translation research and application face challenges including a limited scale of Tibetan-Chinese parallel data,uncertain data quality,unre-fined data processing methods and tools,homogenous data distribution,and a lacking focus on pre-training model research.This thesis proposes solutions to mitigate these issues through a series of studies,which focus on quality evaluation and filtering meth-ods for Tibetan-Chinese parallel data,Tibetan-Chinese word segmentation methods,data augmentation methods based on self-learning and back-translations strategies,and pre-training method for Tibetan-Chinese machine translation model via dictionary-injection.This thesis’s main contributions are summarized as follows:Addressing the significant uncertainty in the quality of Tibetan-Chinese parallel data,this thesis proposes an evaluation and filtering method for Tibetan-Chinese parallel cor-pora via a negative sampling technique.Initially,positive and negative data samples are generated through negative sampling.Following this,a comprehensive end-to-end Tibetan-Chinese parallel sentence pair classifier is developed using a neural network framework to estimate the probability of a given sentence being parallel.This estimate serves as an indicator for data quality evaluation.Finally,several experiments are con-ducted to assess the data quality of Tibetan-Chinese parallel sentence pairs.The data filtering in the Tibetan-Chinese neural machine translation task led to an increase of 3.7BLEU scores,which affirmed the effectiveness of this method.Aiming at the shortcomings of existing data processing methods and tooling for Tibetan-Chinese machine translation,this thesis proposes a Tibetan word segmentation method based on neural networks to efficiently enhance the influence of Tibetan word segmentation performance on Tibetan-Chinese machine translation tasks.Initially,a neu-ral network framework for Tibetan word segmentation and a customized tagging scheme suitable for Tibetan functional suffixes are designed.Subsequently,by leveraging unla-beled and labelled Tibetan textual data,an end-to-end Tibetan word segmentation model is trained using both supervised and unsupervised learning fashions.The precision,recall rate,and₁score of the word segmentation model achieve 93.4%,94.2%,and 94.1%,respectively,validating the effectiveness of the model.Aiming at the issues of restricted distribution and limited domain of Tibetan-Chinese parallel data.This thesis proposed a method for augmenting training data for Tibetan-Chinese machine translation based on self-learning and back-translation.Specifically,a large-scale monolingual data is introduced in the framework of Tibetan-Chinese neu-ral machine translation,and the training data is augmented by alternately iterative self-learning and back-translation,which gradually improves the performance of the forward and backward models,thereby improving the generalization performance of the general domain of the model.Experiments show that the performance of the self-learning model and the back-translation model is higher than that of the Transformer baseline model,with3.1 BLEU values and 8.2 BLEU values respectively.Addressing the dearth of pre-trained model research in Tibetan-Chinese machine translation tasks,this thesis proposes a pre-training method for Tibetan-Chinese machine translation based on dictionary injection.This approach is inspired by the tendency of individuals to use mixed multilingual vocabulary and phrases to enhance communication efficiency in cross-lingual interactions.This paper posits dictionary injection as an ef-fective method of noise addition,and pre-trains the Tibetan-Chinese machine translation model with noise-reducing as the learning objective.This method provides the model with the potential for affordable and extensive bilingual knowledge associations during the pre-training stage.Comparative experiments with strong BART baseline reveal that the BLEU scores of this method in both Tibetan-Chinese and Chinese-Tibetan translations are higher by 2.3 and 2.1,respectively.

Keywords/Search Tags:

Tibetan-Chinese Machine Translation, Pretraining, Tibetan Word Segmentation, Negative Sampling, Self-learning, Dictionary Injection

PDF Full Text Request

Related items

1	Research On Tibetan-Chinese Machine Translation Under The Condition Of Sparse Resources
2	Research On Tibetan-Chinese Neural Machine Translation Incorporating Prior Knowledge
3	Research On Some Key Technologies Of Tibetan Machine Translation Based On Tree To String
4	Research On Automatic Disambiguation Method Of Tibetan Word Meaning Based On Chinese And Tibetan Parallel Corpus
5	Research On Dependency Parsing Of Tibetan Language Based On Deep Learning
6	An MT-oriented Research On Recognition Of Tibetan Syntactic Functional Chunk
7	A Study On Tibetan-Chinese Machine Translation Method Based On Data Enhancement Technology
8	A Study On The Tendency Of Tibetan Sentences
9	Tibetan Automatic Word Segmentation And Part-of-speech Tagging Research
10	The Research Of Phrase Extraction Technology For Tibetan And Chinese Statistical Machine Translation