Font Size: a A A

Research On Cross-Domain Text Classification

Posted on:2013-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:L H LiFull Text:PDF
GTID:2248330392458483Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text classification algorithms have been proven to be efective in automatically or-ganizing text data. In practice, however, it is usually expensive to obtain sufcient labeleddocuments to train a precise classifier for a certain domain, whereas there are plenty oflabeled documents in a related but diferent domain. So it would be favorable if we canleverage the labeled documents from the related domain to train a precise classifier forthe target domain. However, since diferent domains usually difer in their underlyingdistributions, the traditional classification algorithms would be challenged when trainingdata and test data come from diferent distributions.Recently, cross-domain text classification has been proposed to solve the aboveproblem. Cross-domain text classification aims to automatically train a precise text clas-sifier for a target domain by using labeled text data from a related source domain. Inthis thesis, I will discuss our proposed Topic Correlation Analysis (TCA) approach forcross-domain text classification. In TCA, all word features are first grouped into theshared and the domain-specific topics using a joint mixture model. Then the correla-tions between the two kinds of topics are inferred and used to induce a mapping betweenthe domain-specific topics for domain adaptation. The experimental results on two stan-dard text data sets justify the superiority of the proposed method over the stat-of-the-artcross-domain classification methods. Next, I will discuss our proposed solution for themulti-domain active learning on text classification. The proposed query strategy aims tochoose the unlabeled instances which can maximally reduce the model loss of classifierson all domains. The experimental results on three real-world applications demonstratethat our proposed method can save more than30%labeling eforts compared with thestate-of-the-art active learning methods.
Keywords/Search Tags:Text Classification, Cross-Domain Learning, Data Mining
PDF Full Text Request
Related items