Font Size: a A A

Learning to rank with partially-labeled data

Posted on:2010-03-10Degree:Ph.DType:Thesis
University:University of WashingtonCandidate:Duh, Kevin KFull Text:PDF
GTID:2448390002486041Subject:Engineering
Abstract/Summary:
Ranking is a key problem in many applications. In web search, for instance, webpages are ranked such that the most relevant ones are presented to the user first. In machine translation, a set of hypothesized translations are ranked so that the correct one is chosen. Abstractly, the problem of ranking is to predict an ordering over a set of objects. Given the importance of ranking in many applications, "Learning to Rank" has risen as an active research area, crossing disciplines such as machine learning and information retrieval. The approach is to adapt machine learning techniques developed for classification and regression problems to problems with rank structure. However, so far the majority of research has focused on the supervised learning setting. Supervised learning assumes that the ranking algorithm is provided with labeled data indicating the rankings or permutations of objects. Such labels may be expensive to obtain in practice.;The goal of this dissertation is to investigate the problem of ranking in the framework of semi-supervised learning. Semi-supervised learning assumes that data is only partially labeled, i.e. for some sets of objects, labels are not available. This kind of framework seeks to exploit the potentially vast amount of cheap unlabeled data in order to improve upon supervised learning. While both supervised learning for ranking and semi-supervised learning for classification have become active research themes, the combination, semi-supervised learning for ranking, has been less examined. This thesis aims to fill the gap.;The contribution of this thesis is an examination of several ways to exploit unlabeled data in ranking. In particular, four assumptions commonly used in classification (Change of Representation, Covariate Shift, Low Density Separation, Manifold) are extended to the ranking setting. Their implementations are tested on six real-world datasets from Information Retrieval, Machine Translation, and Computational Biology. The algorithmic contributions of this work include (a) a Local/Transductive meta-algorithm, which allows one to plug in different unlabel data assumptions with relative ease, and (b) a kernel defined on lists, which allow one to extend methods which work with samples (i.e. classification, regression) to methods which work with lists of samples (i.e. ranking). We demonstrate that several assumptions about how unlabeled data helps in classification can be successfully applied to the ranking problem, showing improvements over the supervised baseline under different dataset-method combinations.
Keywords/Search Tags:Ranking, Data, Problem, Classification, Supervised
Related items