Font Size: a A A

Cross-Lingual Text Classification Based On Monolingual Word Embedding Mapping Without Parallel Corpus

Posted on:2020-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:N WangFull Text:PDF
GTID:2428330575989342Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Now text classification is a relatively common model in natural language,but the model cannot be used on different languages platforms due to the uniformity of the word.If the text classification model was trained separately on each language,it would spend lots of cost and time;on the other hand,the classification model,as well as a supervised learning method,requires a large number of training samples.Some Mono-lingual text classification model may failed owning to its low resource.In addition,main cross-language word embedding models always depend on costly parallel corpora that makes model cannot transfer between different languages.In response to solving the above problems,this thesis conducts in-depth research on text classification,cross-lingual word embedding,etc.,and proposes a monolingual neural network classification model with attention mechanism,two cross-lingual text classification with non-parallel corpus,as following:(1)For the monolingual text classification model,this thesis proposes a bi-directional GRU neural network model and introduces the attention mechanism into the text classification model.Compare with traditional machine learning methods,the bi-directional GRU text classification model with the attention mechanism have different degrees of improvement for the classification model.Therefore,this model is also used as the text classification model for cross-lingual research.(2)For cross-lingual text classification model using word embedding mapping with non-parallel corpora,this thesis proposes to use two monolingual word embedding to construct a bilingual word embedding model,Based on the current research on adversarial learning,the Procrustes analysis method and the Cross-domain Similarity Local Scaling(CSLS)are introduced to fine-tune the results obtained from the adversarial learning,so that words representations in bilingual word embedding could as close as possible.Secondly,in this thesis,the Procrustes analysis method and the method of cross-domain similarity local scaling(CSLS)are also used in the self-learning training process to continuously adjust the mapping matrix and finally reach the convergence completion training.And the comparative experimental results of BilBOWA,Google translation and w/o CSLS show that the orthogonal constraint and the method of CSLS all improve the performance of the classification model,and using both orthogonal constraint and CSLS has the best results.Self-learning method with CSLS and orthogonal constraint almost catch up those methods with parallel corpus.
Keywords/Search Tags:Text Classification, Cross-Lingual Word Embedding, Attention Mechanism, Cross-domain Similarity Local Scaling, Procrustes Analysis
PDF Full Text Request
Related items