| Military domain oriented Turkish-Chinese neural machine translation(NMT)belongs to the category of domain specific low-resource language machine translation research.NMT usually needs to control the size of the source language vocabulary within 30,000 to 50,000,and the training of NMT model needs large-scale parallel corpus data as support.For such a low-resource language as Turkish,it is bound to face a serious data sparsity problem,that is,the number of common Turkish words can reach millions of magnitude,and a large number of low-frequency words will be processed as "unknown words",which will affect the fluency of translation generated by the translation model.Inspired by the discovery that the performance of NMT can be further improved by using the knowledge of source-side linguistics,this paper starts with key issues such as "unknown words" processing,parallel corpus resource construction,and syntactic information of source language integration.Respectively using the method of morphological parsing to construct Turkish NMT vocabulary,a back-translation based sentence alignment method to screen the open-source Turkish-Chinese bilingual parallel sentences,a similar morphological structure based sentence clustering method to expand the Turkish-Chinese bilingual parallel data,the Turkish Military term auto extraction and replacement method to enhance Military attribute of parallel data.All these aim to improve the ability of Turkish-Chinese NMT model to process Military texts through data processing.The relevant achievements have important academic exploration significance for promoting the development of theories,methods and technologies of Turkish-Chinese NMT.The main academic contributions of this study are as follows:Aiming at several natural language processing tasks,such as Turkish NMT vocabulary construction,named entity recognition,and domain term extraction,this paper proposed a dictionary and rule based Turkish word morphology parsing method,by which we constructed a Turkish morphological parser,that consists of three different morphological parsing forms: "root + morphosyntactic marker","root + morphotactic marker" and "root + inflectional group".The morphological parsing dictionary of this parser is based on TS-Corpus morphological parsing vocabulary,with additional fixed collocations,named entity affixes,unknown words,spelling errors,compound words and morphological disambiguation rules lists.The total number of entries is 1,120,000.The morphological disambiguation rules include disambiguation rules based on word co-occurrence constraints,case affix labeling constraints,and word overall morph-syntactic labeling constraints.The Turkish morphological parser has an open vocabulary optimization function,which can effectively avoid the conflict between rules.In the experiment,three methods were adopted to perform morphological parsing on the training corpus of 1.53 million sentences with a vocabulary size of 742,060.The total vocabulary size of Turkish was reduced by 84.36%,84.78% and 85.33%,respectively.Compared with the basic morphological parsing vocabulary,the morphological parsing method based on "root + inflection group" could reduce the commonly used vocabulary by 21.4%.Aiming at the shortage of Turkish-Chinese bilingual parallel corpus resources,this paper proposed a morphological parsing based Turkish sentence clustering method and designed a sentence clustering based simple Turkish sentence extraction program.The program mainly includes three clustering methods: root structure clustering based on "root +UNK",syntactic structure clustering based on "affix +UNK",and sentence structure clustering with dynamic addition of proper nouns,time,date,and number markers.According to three steps of high-frequency structure sentence extraction,online machine translation and semi-supervised translation selection,500 most common structure sentences were firstly extracted from the Turkish monolingual corpus with a size of 5 million sentences,and then corresponding Chinese translations were obtained by using Bing,Niu Trans and Google online translation systems,and a Turkish-Chinese bilingual parallel corpus with a scale of about 10,000 sentences was finally constructed by manual intervention method.The results showed that this method can effectively obtain a certain scale of high-quality Turkish-Chinese bilingual parallel data to expand the training corpus.Aiming at the problems of misalignment of parallel data and poor translation quality,this paper proposed an back-translation based Turkish-Chinese bilingual sentence alignment inspection method.The method firstly uses Google online machine translation platform to obtain the translation of the sentence to be tested,and then constructs a bag-of-words model to calculate the sentence semantic similarity,so as to automatically realize the verification and extraction of the parallel sentence alignment between Turkish and Chinese.Based on this method,2.1 million Turkish-Chinese bilingual sentence pairs were screened,and a total of1.53 million sentences were extracted and retained as the training corpus of the general domain translation model.The relevant research results shows that the method can effectively improve the quality of Turkish-Chinese bilingual parallel corpus.Aiming at the unavoidable domain specific term problems of Military domain oriented NMT,this paper proposed a hybrid strategy based Military term extraction method and designed a Military domain oriented Turkish Military term auto extraction program.Firstly,we extracted the unique characteristics of the Turkish Military terms by comparing the aviation,communication and Military terminology dictionaries,and according to these characteristics created the stop-word list,keyword list and morphological parsing sequence pattern list,and finally with the help of point-wise mutual information,information entropy and left-right temporary affixes completed construction of the term auto-extraction program.On this basis,we constructed a Turkish-Chinese Military glossary with 1500 entries,by which we also optimized the Military domain Turkish-Chinese bilingual pseudo-parallel corpus with a scale of 90,000 sentences.Aiming at the shortcoming that the NMT model cannot learn the prior knowledge beyond the data,this paper proposed the sequence-based and representation learning based lexical information fusion method to encode Turkish root sequence and morph-syntactic marker sequence respectively.The spliced hidden layer state representation was taken as the word vector representation for model training.In combination with the BPE based sub-word segmentation method,we trained 7 general domain oriented and 2 Military domain oriented Turkish-Chinese NMT models.According to the BLEU evaluation results,the morphological parsing based method of creating Turkish NMT vocabulary was better than the BPE based sub-word segmentation approach,and the morphological parsing method of "root + inflection group" had the best effect.Accordingly,compared with the BPE baseline model,the BLEU value of the trained general domain oriented translation model was 1.15 higher,compared with the general translation mode,the BLEU value of the trained Military domain oriented translation model was 10.08 higher;When morphological parsing was carried out according to the methods of "root + morphological syntactic marker" and "root + morphological marker",the representation learning encoding based translation model gave better performance than the single sequence encoding based translation mode.In short,this work has investigated the design and application of linguistic knowledge and data augmentation in Turkish-Chinese NMT under low-resource conditions.For three different scenarios,such as NMT vocabulary design,Military term automatic extraction and Turkish-Chinese pseudo-parallel data construction,we designed the specific technical strategies respectively to improve the NMT model’s performance and obtained significant improvements.Meanwhile,this thesis also provides new methodologies and perspectives for domain specific low-resource language machine translation research.In the future,the relevant data and technical achievements can also be extended to other NLP tasks to meet the needs of the Military preparation. |