Font Size: a A A

A theory of multitask learning for learning from disparate data sources

Posted on:2004-10-18Degree:Ph.DType:Dissertation
University:Cornell UniversityCandidate:Schuller, Rebecca AnnFull Text:PDF
GTID:1468390011962651Subject:Computer Science
Abstract/Summary:
Many endeavors require the integration of data from multiple data sources. One major obstacle to such undertakings is the fact that different sources may vary considerably in the way they choose to represent their data, even if their data collections are otherwise perfectly compatible. In practice, this problem is usually solved by a manual construction of translations between these data representations, although there have been some recent attempts at supplementing this with automated algorithms based on machine learning methods.; This work addresses the problem of making classification predictions based on data from multiple sources, without constructing explicit translations between them. We view this problem as a special case of the problem of multitask learning problem: both intuition and much empirical work indicate that learning can be improved by attacking multiple related tasks simultaneously. However, thus far, no theoretical work has been able to support this claim, and no concrete definition has been proposed for what it means for two learning tasks to be “related.”; In this work, we introduce a general notion of relatedness between tasks, provide the standard sort of information complexity bound for such tasks, and give general conditions under which this bound is an improvement over standard single task learning results.; Finally, we apply these results to the problem of learning from disparate data sources. We give a decision tree learning algorithm for this problem for a particular type of data source disparity and demonstrate its empirical success on real data sets.
Keywords/Search Tags:Data, Sources, Problem
Related items