Font Size: a A A

Active acquisition of informative training data

Posted on:2011-02-14Degree:Ph.DType:Thesis
University:Tufts UniversityCandidate:Lomasky, RachelFull Text:PDF
GTID:2448390002451673Subject:Computer Science
Abstract/Summary:
The performance of a classifier built from labeled training data is highly dependent on the quality of the data. In many domains, collecting high quality training data can be labor-intensive and expensive. To solve this problem, we must determine that the examples acquired are informative. Ideally, one would gather a training data set with only relevant, non-redundant examples. Additionally, one would acquire this data efficiently, with minimal effort and resources. The time of the human aiding in data generation is precious, and we seek to utilize it wisely. By considering class proportions, this thesis makes three contributions to the process of optimizing the use of human assistance in training data creation for computer-based classifiers. First, we identify a new class of supervised learning problems, in which the process of generating data cannot be separated from the process of obtaining labels. This class of problems, which we call Active Class Selection (ACS) addresses the question: if one can collect n additional training instances, how should they be distributed with respect to class? The second and third contributions involve improving training data collection for a previously identified problem, Active Learning (AL). AL addresses a question distinct from, but related to, ACS: if one has n instances in an unlabeled pool U, which instances from U should one have a human label? We offer two methods of solving this problem. First, we demonstrate how ideas from ACS can be used to perform AL on multiclass datasets. Second, we address a largely neglected problem in AL: When should one stop labeling data because it will not increase the classifier performance? We also explore how to dynamically choose which AL method is best suited for a dataset at a given stage of AL.
Keywords/Search Tags:Data, Class, Active
Related items