Font Size: a A A

Research On Chinese-Vietnamese Neural Machine Translation Method Based On Comparable Corpus

Posted on:2021-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:H D ZhuFull Text:PDF
GTID:2518306200453434Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid improvement of computer capabilities and the introduction of deep learning algorithms,neural machine translation has achieved good translation results in multiple language pairs with large-scale bilingual corpora.However,for Southeast Asian languages represented by Vietnamese,on the one hand,there are scarce Chinese-Vietnamese bilingual corpus resources,on the other hand,Chinese-Vietnamese languages are very different.The neural machine translation method of corpus is not fully suitable for Chinese-Vietnamese machine translation,and the machine translation model trained on small-scale parallel corpus is prone to overfitting,poor generalization performance,and the translation effect is very unsatisfactory.Relatively speaking,comparable corpus resources are relatively abundant,and they are also easy to obtain based on the Internet.Comparable corpus is a bilingual text describing the same event,but it is not a fully aligned bilingual text,which contains a lot of fine-grained bilingual alignment knowledge,such as sentence alignment relationship,phrase alignment relationship,word pair relationship,etc.These alignment knowledge will be related to resources Insufficient ChineseVietnamese machine translation has great support.Therefore,this article studies how to extract bilingual parallel sentence pairs and phrase pair knowledge from Hanyue comparable corpus,and explores the integration of phrase pairs into Hanyue neural machine translation model to improve Hanyue machine translation performance.The main innovations are as follows:(1)Propose a method of extracting Chinese-Vietnamese parallel sentence pairs by syntactic structure and Tree-LSTMThe bilingual parallel sentences contained in the bilingual comparable corpus can effectively alleviate the data sparsity problem of scarce languages.The language difference between Chinese and Vietnamese is large,and the syntactic structure has a very important positive support for the extraction of parallel sentences facing the Chinese-Vietnamese comparable corpus.Therefore,a parallel sentence pair extraction method that combines the syntactic structure and Tree-LSTM is proposed.First,pretrain the Chinese-Vietnamese bilingual word vectors to unify the Chinese-Vietnamese bilingualism in the same semantic space;second,convert the Chinese-Vietnamese bilingual sentences into dependent syntax trees Structure,effectively retain the sentence structure information and semantic representation of Chinese-Vietnamese bilingual sentences,and use the Tree-LSTM model to encode the syntactic tree into sentence vectors;finally,construct a fully connected layer to train the Chinese-Vietnamese parallel sentence pair classifier.The experimental results show that the accuracy of the proposed method reaches 90.3%.(2)Propose the extraction method of Chinese and Vietnamese parallel phrase pairs fused with contextual semantic informationThe bilingual comparable corpus contains a large number of bilingual phrase pairs,considering that the context information of the phrase has a very important positive influence on the extraction quality of the phrase pair.Therefore,a Chinese-Vietnamese parallel phrase pair extraction method fusing contextual semantic information is proposed.First,train Chinese and Vietnamese word vector matrices using Chinese and Vietnamese monolingual corpora;then,pre-train the encoder to combine sentence encoding information and phrase encoding information through an attention mechanism to generate a phrase vector containing contextual semantic information.At the same time,Use parallel phrase pairs as constraints to minimize the distance between Hanyue phrase vectors in the semantic space and maximize the distance between nonparallel phrase pairs to obtain Chinese-Vietnamese bilingual phrase vector representations;Finally,use pre-trained encoders to parallel Phrases train the classifier.Experimental results show that the accuracy of the method proposed in this paper reaches 75.62%.(3)Propose a Chinese-Vietnamese neural machine translation method of fused phrase pairs Taking full advantage of the parallel sentence pairs and parallel phrase pairs extracted from comparable corpora,a Chinese-Vietnamese neural machine translation method of fused phrase pairs is proposed.First,use the extracted Chinese-Vietnamese bilingual parallel sentence pairs to expand the training corpus;secondly,transform the Transformer translation model structure,add two components,a phrase translation table and a scoring module,replace the tag and decoder ends at the encoding end,and merge the phrase pairs to translate Model.The experimental results showed that the Bleu value increased by 1.93.
Keywords/Search Tags:Neural Machine Translation, Chinese-Vietnamese, Comparable Corpus, Translation Knowledge extraction, Translation Knowledge Integration
PDF Full Text Request
Related items