Font Size: a A A

Study On Cross Language Text Categorization

Posted on:2012-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2218330362453615Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cross Language Text Categorization (CLTC) is the task of assigning class labels to documents written in a target language (e.g. Chinese) while the system is trained using labeled examples in a source language (e.g. English). In this thesis, we study two key problems of CLTC.The first problem is the language barrier between the source and target languages. To solve this problem, we propose the Cross Language K-Nearest Neighbors (CLKNN) algorithm which performs Cross Language Text Categorization (CLTC) from the perspective of Information Retrieval. The only external resource required by CLKNN is a bilingual dictionary. Experimental results show that our method gives promising performance, which is better than translation-based method.The second problem for CLTC is the topic drift between languages, which causes the classifier trained on the source language doesn't perform well on the target language. To solve this problem, we propose an active learning algorithm for CLTC. Our algorithm makes use of both labeled data in the source language and unlabeled data in the target language. The classifier learns the classification knowledge from the source language, and then learns the cultural dependent knowledge from the target language. In addition, we extend our algorithm to double viewed form by considering the source and target language as two views of the classification problem. Experiments show that our algorithm can effectively improve the cross language classification performance.
Keywords/Search Tags:text categorization, cross language text categorization, information retrieval, active learning
PDF Full Text Request
Related items