Font Size: a A A

Research On Chinese-korean Cross-lingual Text Classification Method Based On Bilingual Topical Word Embedding Model

Posted on:2020-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:M J TianFull Text:PDF
GTID:2428330572489353Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cross-lingual text classification is a vital technology to leverage the multi-lingual information resources effectively.Cross-lingual text classification can lower the difficulty of information retrieval and text classification due to language differences,it can facilitate the communication of knowledge,and promote economic and social development.As the most popular and used method to achieve the cross-lingual text classification,the bilingual word embedding model can capture the contextual and cross-lingual semantic and embed them to the vector representation of bilingual words.However,the words with multiple meanings are represented by single vector in bilingual word embeddings,and the problem of ambiguity is caused by such unfair representation mechanism,furthermore,it will affect the accuracy of cross-lingual text classification.In order to address the problem above,this dissertation proposed bilingual topical word embedding model to solve the ambiguity caused by polysemy,and improved the classification accuracywiththe deep leaming algorithm.First of all,the Chinese-Korean sentence-aligned parallel corpus for training the bilingual word embeddings were collected,which was composed of 360,000 sentence pairs,and extracted word alignment relations from sentence pairs.Also,more than 4,000 parallel documents were collected for cross-lingual text classification.Secondly,the bilingual topical word embedding model was proposed,which was the combination of bilingual word embeddings and the topic model that contains adaptive multi-prototype attribute.Word representation of bilingual words was obtained by modeling the parallel corpus collected previously through the model proposed in this dissertation.Bilingual words were represented in the same vector space and the different meanings of the words are described by different latent topic concepts.Finally,the word representations of the bilingual words obtained by proposed were inputted into the deep learning text classifier for cross-lingual text classification,which was trained by the text in one language and tested by the text in another language.After extracting and visualizing the bilingual word embeddings obtained by the bilingual topical word embedding model proposed in this dissertation,it is found that the model can train the embedding representation for each meanings of the word that is polysemy.The experimental results show that the bilingual topical word embedding model combined with the deep learning algorithm has the highest accuracy in cross-lingual text classification up to 91.76%,which is better than other classical methods.
Keywords/Search Tags:cross-lingual text classification, bilingual word embeddings, bilingual topic model, multi-prototype representations, polysemy, deep learning algorithm
PDF Full Text Request
Related items