Font Size: a A A

Mining a Shared Concept Space for Domain Adaptation in Text Mining

Posted on:2012-10-06Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:Chen, BoFull Text:PDF
GTID:2468390011464076Subject:Computer Science
Abstract/Summary:
In many text mining applications involving high-dimensional feature space, it is difficult to collect sufficient training data for different domains. One strategy to tackle this problem is to intelligently adapt the trained model from one domain with labeled data to another domain with only unlabeled data. This strategy is known as domain adaptation. However, there are two major limitations of the existing domain adaptation approaches. The first limitation is that they all separate the domain adaptation framework into two separate steps. The first step attempts to minimize the domain gap, and then the second step is to train the predictive model based on the reweighted instances or transformed feature representation. However, such a transformed representation may encode less information affecting the predictive performance. The second limitation is that they are restricted to using the first-order statistics in a Reproducing Kernel Hilbert Space (RKHS) to measure the distribution difference between the source domain and the target domain. In this thesis, we focus on developing solutions for those two limitations hindering the progress of domain adaptation techniques.;We develop a novel model to learn a low-rank shared concept space with respect to two criteria simultaneously: the empirical loss in the source domain, and the embedded distribution gap between the source domain and the target domain. Besides, we can transfer the predictive power from the extracted common features to the characteristic features in the target domain by the feature graph Laplacian. Moreover, we can kernelize our proposed method in the Reproducing Kernel Hilbert Space (RKHS) so as to generalize our model by making use of the powerful kernel functions. We theoretically analyze the expected error evaluated by common convex loss functions in the target domain under the empirical risk minimization framework, showing that the error bound can be controlled by the expected loss in the source domain, and the embedded distribution gap.;Then we propose an improved symmetric Stein's loss (SSL) function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Based on our proposed distribution gap measure based on second-order statistics, we present another new domain adaptation method called Location and Scatter Matching. The target is to find a good feature representation which can reduce the embedded distribution gap measured by SSL between the source domain and the target domain, at the same time, ensure the new derived representation can encode sufficient discriminants with respect to the label information. Then a standard machine learning algorithm, such as Support Vector Machine (SYM), can be adapted to train classifiers in the new feature subspace across domains.;We conduct a series of experiments on real-world datasets to demonstrate the performance of our proposed approaches comparing with other competitive methods. The results show significant improvement over existing domain adaptation approaches.
Keywords/Search Tags:Domain, Space, Embedded distribution gap, Feature
Related items