Research On Unsupervised Cross-lingual Mappings Of Word Embeddings

Posted on:2021-05-05

Degree:Master

Type:Thesis

Country:China

Candidate:S Z Yang

Full Text:PDF

GTID:2428330611498645

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Word vector(Word Embedding)is the mainstream representation of words today.Unsupervised Cross-lingual Emeddings based on mapping aims to map the word embeddings of the source language and the target language,which are trained in different languages using monolingual corpora independly before,into the same vector space without using any cross-language knowledge,so that words with the same semantics in different languages could have higher similarity.Solving this problem is of great significance to solve the digital divide problem of low-resources language and the task of cross-language natural language processing.In recent years,researchers in various countries have made many achievements in this direction.However,the existing unsupervised methods have two shortcomings:(1)the problem of not dealing with ambiguity;(2)the lack of robustness in practical application scenarios,especially for distant language pairs.In response to the above problems,the main research work of this article includes the following aspects:Firstly,this article analyze the reasons for the lack of robustness,and find that the isomorphism assumption that the unsupervised method relies on is not valid in practical application scenarios.After that,the robustness of the unsupervised method is quantitatively analyzed and we proposed the distance measurement method of the language word embedding models.This measurement method successfully demonstrates the distance of natural languages under linguistic knowledge.Languages in different language families have larger distance than languages in the same language family.Then the research shows that there is a linear relationship between language distance and the perf-ormance of-the unsupervised model,which is the language with a longer relationship has a significant decrease in the performance of the unsupervised modelThen this paper proposes an improvement method of the existing model from the two aspects:building the initial solution and enhancing the self-learning process.The method of enhancing self-learning can obtain comparable results compared with existing methods without constructing an initial seed dictionary.It could work with a completely random initial solution.On this basis,this paper further confirms the implementation details of model optimization.In the evaluation experiment,the performance of the optimized model has been significantly improved,especially for the language pairs that are far away,and the improved model has a greater improvement in robustness.

Keywords/Search Tags:

Word embedding, Unsupervised learning, Cross-lingual learning

PDF Full Text Request

Related items

1	Research On Unsupervised Cross-lingual Word Embedding Model Based On Feedback System
2	Research On Cross-lingual Word Embedding Construction Methods Based On Deep Semantics
3	Research On Mongolian-Chinese Cross-Lingual Word Embedding Learning Based On BERT
4	Unsupervised Cross-lingual Word Representation Learning Method Based On Co-training
5	The Research On Learning Cross-lingual Word Embeddings Based On Adversarial Training
6	Cross-Lingual Text Classification Based On Monolingual Word Embedding Mapping Without Parallel Corpus
7	Research On Machine Reading Comprehension Model Based On Cross-lingual Transfer Technology
8	Research On Chinese-korean Cross-lingual Text Classification Method Based On Bilingual Topical Word Embedding Model
9	Bilingual Word Representation Learning From Non-parallel Corpora
10	Research On Unsupervised Named Entity Recognition Based On Cross-lingual Transfer