Font Size: a A A

Bilingual Word Representation Learning From Non-parallel Corpora

Posted on:2019-04-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:1368330590951477Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increase of international exchanges,smooth cross-lingual communication becomes more and more demanding.Under this background,cross-lingual natural lan-guage processing will play an important role.Word representation is the basis of almost all natural language processing tasks.In cross-lingual natural language processing,bilin-gual word representation learning has also attracted extensive attention from researchers.Parallel corpus is the ideal corpus for this task,but it is a scarce resource for many langua-ges and specialized domains,whereas non-parallel corpus is more abundant.Therefore,using non-parallel corpus to learn bilingual word representation can be applied in more scenarios.However,as cross-lingual signals in non-parallel corpora are more difficult to capture,they also pose more challenges for academic research.Most of the existing related work still relies on bilingual supervision of signals for learning.Using the strength of the bilingual supervision as the clue,this dissertation discusses challenges in related work,and in turn presents corresponding research work for each challenge.The main contents are as follows.1.Supervised scenarios.Existing research on bilingual word representation lear-ning is mostly conducted with abundant bilingual supervision.However,there are still challenges to be addressed.For one thing,the common practice of using nearest neighbor to build bilingual lexicon from bilingual word representation has inherent limitations.For another,existing work does not take into account the phenomenon of multiple alternative word translations,which is prevalent across natural languages.This dissertation proposes using the earth mover's distance for word translation,and discovers that it not only ad-dresses the limitations faced by the nearest neighbor approach,but also handles multiple alternative word translations automatically.Furthermore,its effectiveness can be boosted by introducing the idea from the word translation phase to the training phase of bilingual word representation.2.Weakly-supervised scenarios.For many low-resource languages and specialized domains,bilingual supervision signals are usually scarce and difficult to obtain.In order to address this challenge of limited supervision signals,this dissertation proposes a bilingual word representation matching model based on latent variables.It can fully utilize the limited bilingual supervision signals so that good performance can be achieved in weakly-supervised scenarios.3.Unsupervised scenarios.Going further along the challenge of limited bilingual supervision,this dissertation explores the possibility of bilingual word representation lear-ning in unsupervised scenarios.First,the idea of adversarial training is explored to tackle this task.Then,this dissertation proposes a more general framework of distribution dis-tance minimization,with the earth mover's distance as the choice of distribution distance.Experimental results demonstrate that it is feasible to learn bilingual word representation even in the harsh condition of zero bilingual supervision.
Keywords/Search Tags:Bilingual Word Representation Learning, Bilingual Word Embeddings, Bi-lingual Lexicon Induction, Non-Parallel Corpora, Cross-Lingual Natural Language Processing
PDF Full Text Request
Related items