Font Size: a A A

Research On The Application Of Machine Translation In Cross-lingual Document Classification

Posted on:2019-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q Z LiuFull Text:PDF
GTID:2428330566498098Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cross-lingual document classification(CLDC)is document classification task,the model of which is trained on a source language and tested on a target language.For documnet classification tasks in a specific language,supervised methods usually require expensive human labeled training corpora,which is particularly difficult to access in lowresource languages.CLDC attempts to exploit labeled training datasets in a source language to solve document classification problems in a target language.As a cross-lingual classification task,CLDC is of academic importance for the study of cross-lingual transfer learning.And since most langauges in the world are low-resource languages,CLDC is also of practical value in the industry.Machine translation(MT)is one of the most intuitive approach to map data and knowledge from one language into another one.However,previous works report that the performance of MT based approach in CLDC is unsatisfactory.We examine the original MT based approach close,and experiments supoort our intuition that the sparsity of word features is the bottleneck.More experiments show that the performance of the original MT based approach could be improved significantly by using feature grouping to alleviate the sparsity problem.Based on these founding,we propose a new architecture combining any off-the-shelf machine translation models and monolingual word embeddings for CLDC.Experiments show that our MT + monolingual is able to achieve or out-performs state-of-the-art models under different senarios no matter bilingual parallel copora is available or not.More analysis reviles that our apporach is robust with respect to the performance of the machine translation system used and the word embedding models.Besides,examples show that our approach is more sensitive to some word sematics under specific context with the help of machine translation system compared to use biilngual word embeddings directly,which helps the classification task in some cases.
Keywords/Search Tags:Natural Language Processing, Cross-lingual Document Classification, Machine Translation, Word Embedding
PDF Full Text Request
Related items