Font Size: a A A

Research On Mongolian-Chinese Cross-Lingual Word Embedding Learning Based On BERT

Posted on:2022-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y R WangFull Text:PDF
GTID:2518306779975769Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Word embedding is the basis of natural language processing tasks.Cross-Lingual word embedding maps monolingual word embedding to a shared low-dimensional space with the help of transfer learning,and transfer syntactic,semantic and structural features between different languages,so that cross-lingual semantic information can be modeled.It is an important basic step to solve the cross-lingual information processing caused by low resource language information processing and language gap.However,at present,the learning performance of cross-lingual word embedding is largely dependent on large-scale parallel corpus or high-quality seed dictionary,and the learning effect is not obvious for Mongolian-Chinese cross-lingual word embedding with less parallel corpus.Mongolian is a low-resource language,and it is difficult to obtain large-scale Mongolian-Chinese parallel sentence pairs.Moreover,Mongolian word formation method is unique,complex and has variable morphology of adhesion,which leads to more serious data sparseness and out-of-vocabulary words problems when using neural network to learn Mongolian-Chinese cross-lingual word embedding.Multilingual BERT model is a kind of dynamic word embedding model pre-trained on large scale high resource language corpus,which contains rich multilingual syntactic and semantic information.On this basis,the second fine-tuned model can solve the problem of out-of-vocabulary words and data sparsity.But it does not take into account Mongolian word formation,and Mongolian text is not trained in pre-training.In view of the above problems,the main research contents of this thesis include the following points:(1)Aiming at the current Mongolian static word embedding learning model,a deep transfer Mongolian dynamic word embedding learning model is proposed.Based on the Multilingual BERT pre-training model,we use small-scale corpus to fine-tone,and map the grammar and semantic features learned from the high-resource corpus to the low-resource Mongolian dynamic word embedding representation through transfer learning.In order to verify the effectiveness of the method,synonym comparison experiments were carried out with different models on the data set constructed by the our team,and the K-means clustering algorithm was used to perform clustering analysis on Mongolian words.Finally,the results were verified in the embedded keyword mining task.The experimental results show that the quality of word embedding learned by BERT is higher than that of Word2 Vec static model.The vectors of close words are very close in the vector space,while the vectors of unclose words are far away.The keywords obtained in the thematic word mining task are closely related.(2)Aiming at the lack of a large number of parallel sentences in constructing Mongolian-Chinese cross-lingual word embedding,a deep language knowledge sharing transfer model is proposed.Using a small scale of Mongolian and Chinese parallel sentence pairs,we learn cross-lingual word embedding representation,sub-word learning and parallel sentence judgment by sharing model parameters and language knowledge union.The self-attention mechanism is used to further learn the semantic relationship between each word in Mongolian-Chinese sentence pairs,so as to construct Mongolian-Chinese cross-lingual word embedding.Finally,the bilingual dictionary induction task is used to evaluate the performance of Mongolian-Chinese word embedding alignment.Experiments show that the proposed method has obvious improvement over the baseline model.(3)In order to solve the problem of over-fitting the large model of small data training,a data enhancement method of mixing and switching sentence order across linguistic corpus is proposed.The generated pseudo-data are used to capture words and their rich cross-lingual context,and further optimize the semantic alignment performance of Mongolian-Chinese cross-lingual word embedding.Experiments show that the data enhancement method can further improve the performance of cross-lingual dictionaries.Finally,some meaningful conclusions are obtained from the in-depth analysis of the weight parameters of the self-attention mechanism of the model,which provides meaningful value for the next work.
Keywords/Search Tags:Mongolian-Chinese Cross-lingual, BERT, Word embedding, Transfer learning, Data augmentation
PDF Full Text Request
Related items