Font Size: a A A

Unsupervised Cross-lingual Word Representation Learning Method Based On Co-training

Posted on:2022-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z C SuFull Text:PDF
GTID:2518306572959899Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Cross-lingual word embedding means that the corresponding embeddings of words from different languages are in the same vector space,so that the similarity between the words from different languages can be measured easily.Unsupervised cross-lingual word representation learning aims to learn cross-lingual word embeddings without any external cross-lingual information.Although existing unsupervised cross-lingual word representation learning methods have achieved certain results,there still exists many shortcomings.One of the disadvantages is that the bilingual translation dictionary acquisition method in the self-learning step is too simple,so that It can't provide high-confidence bilingual information for the subsequent iterative steps,which affects the effect of the self-learning process and finally has a negative impact on the performance of cross-lingual word embeddings.To solve this problem,this paper proposes an unsupervised cross-language word representation learning method based on co-training,so as to increase the quality of cross-lingual word representations.This paper attempts to compare the bilingual translation dictionaries used in selflearning steps of different training sub-processes in the co-training process,and select more credible bilingual translation pairs for subsequent training steps of each process,thereby improving the quality of information used in the training process and ultimately improving the model's performance.Specifically,this paper designs an unsupervised cross-language word representation co-training method based on different word embedding models and another co-training method based on different corpus sources,and both of them perform better than the baseline model.This paper also explores the principal component analysis method based on the linear autoencoder,and realizes the principal component acquisition method based on the linear autoencoder for the pointwise mutual information matrix obtained on the monolingual corpus.On this basis,a cross-lingual word representation co-training method based on linear autoencoder is designed,which improves the effect of crosslingual word embedding learing,and further verify the feasibility of the co-training method of unsupervised cross-lingual word representation learning.
Keywords/Search Tags:cross-lingual word representation, co-training, unsupervised learning
PDF Full Text Request
Related items