Font Size: a A A

Research On Tibetan-Chinese Neural Machine Translation Incorporating Prior Knowledge

Posted on:2022-12-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:M X ZhouFull Text:PDF
GTID:1488306767460564Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the gradual popularization and deepening of the application of computer technology in all walks of life,the accelerated integration of the global economy and the frequent exchanges between different countries and regions,the translation by manual way no longer meets the status quo of the rapidly growing demand for translation in today's society both in terms of time and cost,people turn their attention to machine translation methods,which brings a new development opportunity for machine translation.In recent years,the emergence of deep learning methods has led to the rapid development of artificial intelligence,and Neural Machine Translation has become a new research direction of Machine Translation instead of statistical Machine Translation methods.Neural network translation models rely on large-scale bilingual training corpus to obtain high-quality translation results.Studies have shown that the performance of neural machine translation(NMT)decreases significantly under the condition of insufficient resources.Due to the scarcity of digital resources in Tibetan,there are only small-scale Tibetan and Chinese bilingual parallel corpora to train translation models.Therefore,this thesis proposes a method of integrating prior knowledge,which can solve the problems caused by the lack of Tibetan-Chinese bilingual corpus to a certain extent and improve the quality of Tibetan-Chinese machine translation.This thesis specifically studies how to integrate the following four different types of prior knowledge,and the main research contents and innovations are as follows:1.Study on Tibetan Sentence Similarity Evaluation Based on Vectorized Representation Techniques: In view of the problem that there are few researches on the similarity calculation methods of Tibetan sentences and the accuracy of existing methods is low,this paper proposes a method of sentence similarity calculation of Tibetan sentences by integrating word embeddings.Firstly,two Tibetan word embeddings are obtained by training a 500 M Tibetan monollingua corpus through skpp-gram model and CBOW model.Then,the Tibetan sentence embeddings is calculated accordingly.Finally,two methods for calculating the similarity of Tibetan sentences based on surface information are designed and implemented: one is based on the distance between word embeddings and Euclidean and the other is based on the similarity between word embeddings and Jaccard.Comparative experiments show that the Tibetan sentence similarity calculation method based on skip-Gram word embeddings and Jaccard similarity can get 85.6% accuracy,which is better than other combination methods.2.Research on Domain Adaptation for Tibetan-Chinese Neural machine Translation incorporating extraterritorial models: In view of the present few effective training for different domain Tibetan-Chinese neural machine translation method in the field of research,this paper proposes a Tibetan-Chinese domain adaption method based on hybrid fine-tuning,firstly to use 200000 words of Tibetan – Chinese general parallel corpus training a Tibetan-Chinese general translation model,and then through the domain adaption method to mixed fine-tuning the parent model,Use 50000 words of the Tibetan-Chinese government document parallel corpus and fifteen thousand words of Tibetan-Chinese parallel corpus on the basis of the parent model to train natural science and the government documents MT,experiments prove that under the condition of low resources based on the out-domain model,the method can quickly and effectively training in-domain translation model,Moreover,the overall performance of the model is better than that of the external model,and the BLEU values of the test sets in their respective fields are improved to 19.03 and 12.15 compared with the general model.3.Research on Tibetan-Chinese Neural Machine Translation Based on Part-of-speech Features: in order to use more external information based on limited corpus to get the best performance,in this paper,We introduce a Tibetan parts of speech characteristics,namely in the process of training to join the source side use Tibetan part-of-speech tagging(POS)as input features by generalization of the encoder embedded layer in encoder-decoder architecture in Transformer attention mechanism to support embedding of part-of-speech feature information except lexical feature.By comparing the two different embedding methods of merge and concat,the experiment verifies that the concat method improves the translation effect more obviously,and the BLEU value increases by 3.99.4.Research on Tibetan-Chinese Neural Machine Translation Combined with Statistical Methods: The results of word alignment in Tibetan-Chinese statistical machine translation are better and the alignment information in the Tibetan-Chinese Neural Machine Translation model is significantly different from it.This paper proposes a tibetan-Chinese Neural Machine Translation method combining statistical methods.firstly use statistical machine translation method to generate Tibetan-Chinese parallel corpora bidirectional symmetric word alignment information,Then this word alignment information is used in Transformer model training to supervise the training process of Tibetan-Chinese Neural Machine Translation model,so that the model can achieve more accurate translation and alignment effect.Experimental results show that BLEU value can be increased by 1.7 under low resource environment.To sum up,this thesis attempts to solve some existing problems in Tibetan-Chinese machine translation by integrating prior knowledge other than bilingual parallel corpora required by conventional Neural Machine Translation,such as Tibetan monolingual corpus,out-domain model,Tibetan part-of-speech tagging information and Tibetan-Chinese word alignment information.Experiments show that the integrating of prior knowledge can improve the quality of Tibetan-Chinese machine translation to a certain extent.This thesis also lays a foundation for further integrating more abundant priori knowledge into Tibetan-Chinese machine translation in the future,and has certain reference value for related research in the future.
Keywords/Search Tags:Tibetan - Chinese Machine Translation, Prior Knowledge, Neural Machine Translation, Low resources, Word Embedding, Statistical methods, POS tagging
PDF Full Text Request
Related items