Font Size: a A A

Research On Unsupervised Cross-lingual Mappings Of Word Embeddings

Posted on:2021-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:S Z YangFull Text:PDF
GTID:2428330611498645Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Word vector(Word Embedding)is the mainstream representation of words today.Unsupervised Cross-lingual Emeddings based on mapping aims to map the word embeddings of the source language and the target language,which are trained in different languages using monolingual corpora independly before,into the same vector space without using any cross-language knowledge,so that words with the same semantics in different languages could have higher similarity.Solving this problem is of great significance to solve the digital divide problem of low-resources language and the task of cross-language natural language processing.In recent years,researchers in various countries have made many achievements in this direction.However,the existing unsupervised methods have two shortcomings:(1)the problem of not dealing with ambiguity;(2)the lack of robustness in practical application scenarios,especially for distant language pairs.In response to the above problems,the main research work of this article includes the following aspects:Firstly,this article analyze the reasons for the lack of robustness,and find that the isomorphism assumption that the unsupervised method relies on is not valid in practical application scenarios.After that,the robustness of the unsupervised method is quantitatively analyzed and we proposed the distance measurement method of the language word embedding models.This measurement method successfully demonstrates the distance of natural languages under linguistic knowledge.Languages in different language families have larger distance than languages in the same language family.Then the research shows that there is a linear relationship between language distance and the perf-ormance of-the unsupervised model,which is the language with a longer relationship has a significant decrease in the performance of the unsupervised modelThen this paper proposes an improvement method of the existing model from the two aspects:building the initial solution and enhancing the self-learning process.The method of enhancing self-learning can obtain comparable results compared with existing methods without constructing an initial seed dictionary.It could work with a completely random initial solution.On this basis,this paper further confirms the implementation details of model optimization.In the evaluation experiment,the performance of the optimized model has been significantly improved,especially for the language pairs that are far away,and the improved model has a greater improvement in robustness.
Keywords/Search Tags:Word embedding, Unsupervised learning, Cross-lingual learning
PDF Full Text Request
Related items