Font Size: a A A

Sampling the top-k representative data for classification

Posted on:2006-08-08Degree:M.ScType:Thesis
University:Simon Fraser University (Canada)Candidate:Wang, PingFull Text:PDF
GTID:2458390005997663Subject:Computer Science
Abstract/Summary:
Building classification models based on databases is an exciting area in data mining research. In many classification tasks, only a small set of labelled training data are given. These data are not sufficient for a good classification. We need to sample and label more data as training data for better performance. However, labelling data is time-consuming and costly. The challenge is to effectively select the most representative data for labelling.; While most active learning methods for this problem follow the incremental query learning paradigm in which the classifier is retained upon each newly labelled query, we present a distance-based method which samples the top-k representative data simultaneously and can be applied to any distance-based classifiers. Redundancy reduction makes classifier retraining unnecessary and makes it find more balanced examples with regard to class distribution in database. Experiment results from two data sets and two classifiers demonstrate the advantages of our method.
Keywords/Search Tags:Top-k representative data, Classification
Related items