Study On Cross Language Text Categorization

Posted on:2012-11-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2218330362453615

Subject:Computer application technology

Abstract/Summary:

Cross Language Text Categorization (CLTC) is the task of assigning class labels to documents written in a target language (e.g. Chinese) while the system is trained using labeled examples in a source language (e.g. English). In this thesis, we study two key problems of CLTC.The first problem is the language barrier between the source and target languages. To solve this problem, we propose the Cross Language K-Nearest Neighbors (CLKNN) algorithm which performs Cross Language Text Categorization (CLTC) from the perspective of Information Retrieval. The only external resource required by CLKNN is a bilingual dictionary. Experimental results show that our method gives promising performance, which is better than translation-based method.The second problem for CLTC is the topic drift between languages, which causes the classifier trained on the source language doesn't perform well on the target language. To solve this problem, we propose an active learning algorithm for CLTC. Our algorithm makes use of both labeled data in the source language and unlabeled data in the target language. The classifier learns the classification knowledge from the source language, and then learns the cultural dependent knowledge from the target language. In addition, we extend our algorithm to double viewed form by considering the source and target language as two views of the classification problem. Experiments show that our algorithm can effectively improve the cross language classification performance.

Keywords/Search Tags:

text categorization, cross language text categorization, information retrieval, active learning

Related items

1	A Study On Text Categorization Based On Machine Learning
2	The Research On Cross Language Text Categorization Based On Interlingua Semantic
3	Fast Text Categorization Research
4	The Research On Several Key Techniques In Text Information Processing
5	Research On Web Information Retrieval Technology Based On Text Categorization
6	The Research And Implementation Of Chinese Text Categorization
7	Research On Bilingual Topic Model And Its Algorithm In Cross-language Information Retrieval
8	Research On Text Categorization Technology Based On Deep Learning
9	Study Of Text Categorization And Image Restoration In Modern Information Retrieval
10	Research And Implementation Of Text Categorization System Based On VSM