Font Size: a A A

Research On Heterogeneous Machine Learning For Cross-Domain Document Classification

Posted on:2014-01-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q TanFull Text:PDF
GTID:1268330425476682Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Traditional machine learning approaches make a basic assumption that the trainingand test data should be drawn from the same feature space with the identicaldistribution. In many real-world applications, however, this independent andidentically distributed (i.i.d.) assumption does not hold. On the other hand, abundantlabeled data may exist in some related domains. The target domain data are commonlydrawn from a different feature space and follow a different distribution from that ofthe source domain. Transfer learning models data that are from related but notidentically distributed sources, which plays an important role in the areas of machinelearning and data mining. If it is done successfully, knowledge transfer would greatlyimprove the performance of learning by avoiding tremendously expensive dataannotation effort. Transfer learning is widely applied in the areas such as text mining,sentiment classification, collaborative filtering, computer vision and web searchranking. In addition to domain heterogeneity, there are various types of dataheterogeneity in real-world machine learning applications, such as view heterogeneity,heterogeneity of online knowledge base, and etc.This thesis is focused on heterogeneous machine learning. To tackle different kinds ofdata heterogeneous problem, we aim to mine the hidden common knowledge betweenthe heterogeneous data from different domain (task, view, etc.), which bridges thedomain gap and helps knowledge transfer. For domain heterogeneity problem, wepropose a novel instance weighting approach which aims to weigh out importantsource instance and reuse them to build the target model. For dual-heterogeneityproblem (i.e., domain heterogeneity and view heterogeneity), we propose a novelmulti-view transfer learning framework which tries to bridge the domain gap andenhance view consistency simultaneously. At the same time, for the specificcross-domain document classification problem, we also propose a new topic modelwhich combines the content information and link structure into a unified probabilistic framework. For the problem of heterogeneity of online knowledge bases, weintroduce a new approach to build an auxiliary link network and mine the hiddenco-citation relationship from the auxiliary network, which helps alleviate the datasparsity problem and bridge the domain gap. In specific, the main researchachievements are as follows:1. Transfer learning based on kernel mean matching with a large marginVarious instance weighting methods have been proposed for instance-based transferlearning. Kernel Mean Matching (KMM) is one of the typical instance weightingapproaches which estimates the instance importance by matching the two distributionsin the universal reproducing kernel Hilbert space (RKHS). However, KMM is anunsupervised learning approach which does not utilize the class label knowledge ofthe source data. In this thesis, we extended KMM by leveraging the class labelknowledge and integrated KMM and SVM into a unified optimization frameworkcalled KMM-LM (Large Margin). The objective of KMM-LM is to maximize thegeometric soft margin, and minimize the empirical classification error together withthe domain discrepancy based on KMM simultaneously. KMM-LM utilizes aniterative minimization algorithm to find the optimal weight vector of the classificationdecision hyperplane and the importance weight vector of the instances in the sourcedomain. The experiments show that KMM-LM outperforms the state-of-the-artbaselines.2. Knowledge transfer across different domain data with multiple viewsIn many real-world applications in the areas of data mining, the distributions oftesting data are different from that of training data. And on the other hand, many dataare often represented by multiple views which are of importance to learning. However,little work has been done for it. We argue that how to measure the domain discrepancyacross different domains and enhance consistency among multiple views are two keyissues to the success of multi-view transfer learning. In this thesis, we explored toleverage the multi-view information across different domains for knowledge transfer.We proposed a novel transfer learning model which integrates the domain distanceand view consistency into a2-view support vector machine framework, namely DV2S. The objective of DV2S is to find the optimal feature mapping such that under theprojections the classification margin is maximized, while both the domain distanceand the disagreement between multiple views are minimized simultaneously.Experiments showed that DV2S outperforms a variety of state-of-the-art algorithms.3. Transfer learning based on auxiliary interlink network with online knowledgebaseHow to leverage the large amount of online knowledge base to improve the learningperformance is an important and novel problem. We make a survey on the transferlearning algorithms based on the online knowledge, such as online encyclopedia,social network, user-generated social media. The survey is focused on why and howthe online knowledge can be used to improve the transfer performance amongdifferent domains. Furthermore, we propose a novel approach to construct anauxiliary link network by leveraging the background knowledge, and utilize it todiscover the direct or indirect co-citation relationship among documents byembedding the background knowledge into a graph kernel. The mined co-citationrelationship is not only reduced data sparseness, enriched documents expression, butalso increased the dimension of feature space, and leveraged to bridge the gap acrossdifferent domains. Experiments on different types of datasets demonstrate theadvantage of transfer learning based on an auxiliary interlink network with onlineknowledge base.4. A topic model based on multiple views for cross-domain document classificationTransfer learning utilizes labeled data available from some related domain (sourcedomain) for achieving effective knowledge transformation to the target domain.However, most state-of-the-art cross-domain classification methods treat documentsas plain text and ignore the hyperlink (or citation) relationship existing among thedocuments. In this thesis, we proposed a novel cross-domain document classificationapproach called a Topic model with Multiple Views (TMV). The model is based on anassumption that the documents of source and target domains share some commontopics from the point of view of both content information and link structure. Bymapping both domains data into the latent topic spaces, TMV encodes the knowledge about domain commonality and difference as the shared topics with associateddifferential probabilities. The key step of TMV is to simultaneously combine thecontent information and link structures into a unified latent topic model. Then theshared topics act as the bridge to facilitate knowledge transfer from the source to thetarget domains. Experiments on different types of datasets show that our algorithmsignificantly improves the generalization performance of cross-domain documentclassification.In summary, we propose various heterogeneous machine learning approaches to tackledifferent data heterogeneity problems. The experimental results demonstrate theeffectiveness of the proposed methods. It is worth noting that though we focus oncross-domain document classification in this thesis, the proposed approaches areeasily extended to other machine learning areas such as image classification,sentiment classification, collaborating filtering and recommendation, Web searchranking, and etc.
Keywords/Search Tags:transfer learning, text mining, large margin, multi-view learning, linksnetwork, topic model
PDF Full Text Request
Related items