Research On Chinese-korean Cross-lingual Text Classification Method Based On Bilingual Topical Word Embedding Model

Posted on:2020-05-24

Degree:Master

Type:Thesis

Country:China

Candidate:M J Tian

Full Text:PDF

GTID:2428330572489353

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Cross-lingual text classification is a vital technology to leverage the multi-lingual information resources effectively.Cross-lingual text classification can lower the difficulty of information retrieval and text classification due to language differences,it can facilitate the communication of knowledge,and promote economic and social development.As the most popular and used method to achieve the cross-lingual text classification,the bilingual word embedding model can capture the contextual and cross-lingual semantic and embed them to the vector representation of bilingual words.However,the words with multiple meanings are represented by single vector in bilingual word embeddings,and the problem of ambiguity is caused by such unfair representation mechanism,furthermore,it will affect the accuracy of cross-lingual text classification.In order to address the problem above,this dissertation proposed bilingual topical word embedding model to solve the ambiguity caused by polysemy,and improved the classification accuracywiththe deep leaming algorithm.First of all,the Chinese-Korean sentence-aligned parallel corpus for training the bilingual word embeddings were collected,which was composed of 360,000 sentence pairs,and extracted word alignment relations from sentence pairs.Also,more than 4,000 parallel documents were collected for cross-lingual text classification.Secondly,the bilingual topical word embedding model was proposed,which was the combination of bilingual word embeddings and the topic model that contains adaptive multi-prototype attribute.Word representation of bilingual words was obtained by modeling the parallel corpus collected previously through the model proposed in this dissertation.Bilingual words were represented in the same vector space and the different meanings of the words are described by different latent topic concepts.Finally,the word representations of the bilingual words obtained by proposed were inputted into the deep learning text classifier for cross-lingual text classification,which was trained by the text in one language and tested by the text in another language.After extracting and visualizing the bilingual word embeddings obtained by the bilingual topical word embedding model proposed in this dissertation,it is found that the model can train the embedding representation for each meanings of the word that is polysemy.The experimental results show that the bilingual topical word embedding model combined with the deep learning algorithm has the highest accuracy in cross-lingual text classification up to 91.76%,which is better than other classical methods.

Keywords/Search Tags:

cross-lingual text classification, bilingual word embeddings, bilingual topic model, multi-prototype representations, polysemy, deep learning algorithm

PDF Full Text Request

Related items

1	Bilingual Word Representation Learning From Non-parallel Corpora
2	A Research Of Document Representation And Bilingual Word Embeddings
3	Research On Bilingual Topic Model And Its Algorithm In Cross-language Information Retrieval
4	Research On Bilingual Text Clustering Based On Semantic Duality Model
5	Word Embeddings Towards Text Classification Of Emotion And Topic
6	Research On Bilingual Alignment For Biomedicine
7	Research On Cross-lingual Word Embedding Construction Methods Based On Deep Semantics
8	Han Yue Bilingual News Topic Discovery Research
9	Bilingual Word Embedding Based Word Alignment On Large-Scale Corpus
10	A Research On Text Vector Representations And Modelling Based On Neural Networks