Font Size: a A A

The safe use of synthetic data in classification

Posted on:2009-12-24Degree:Ph.DType:Dissertation
University:Lehigh UniversityCandidate:Nonnemaker, Jean EFull Text:PDF
GTID:1448390005452904Subject:Computer Science
Abstract/Summary:
When is it safe to use synthetic data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. This is in the context of supervised classification in which classifiers are designed fully automatically by learning from a file of labeled training samples. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we trust artificially generated data to be an accurate representative of the real distributions?" Other questions include "When will training on synthetic data work as well as---or better than---training on real data?";We distinguish between sample space (the set of all real samples), parameter or generator space (samples that can be generated synthetically), and finally, feature space (samples described by numerical feature values). Synthetic data can be produced in what we call parameter space by varying the parameters that control their generation. We are interested in exploring how generator and feature space relate to one another. Specifically, we have explored the feasibility of varying the generating parameters for typefaces in Knuth's Metafont system to see if previously unseen fonts could also be recognized.;Generally, we have attempted to formalize a reliable methodology for the generation and use of synthetic data in supervised classification. We have designed and carried out systematically a family of experiments in which pure typefaces already widely used are supplemented with synthetically generated typefaces interpolated in generator or parameter space in the Metafont system. We also vary image quality widely using a parameterized image defect generator. (Abstract shortened by UMI.)...
Keywords/Search Tags:Synthetic data, Classification, Training sets, Generator
Related items