The safe use of synthetic data in classification

Posted on:2009-12-24

Degree:Ph.D

Type:Dissertation

University:Lehigh University

Candidate:Nonnemaker, Jean E

Full Text:PDF

GTID:1448390005452904

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

When is it safe to use synthetic data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. This is in the context of supervised classification in which classifiers are designed fully automatically by learning from a file of labeled training samples. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we trust artificially generated data to be an accurate representative of the real distributions?" Other questions include "When will training on synthetic data work as well as---or better than---training on real data?";We distinguish between sample space (the set of all real samples), parameter or generator space (samples that can be generated synthetically), and finally, feature space (samples described by numerical feature values). Synthetic data can be produced in what we call parameter space by varying the parameters that control their generation. We are interested in exploring how generator and feature space relate to one another. Specifically, we have explored the feasibility of varying the generating parameters for typefaces in Knuth's Metafont system to see if previously unseen fonts could also be recognized.;Generally, we have attempted to formalize a reliable methodology for the generation and use of synthetic data in supervised classification. We have designed and carried out systematically a family of experiments in which pure typefaces already widely used are supplemented with synthetically generated typefaces interpolated in generator or parameter space in the Metafont system. We also vary image quality widely using a parameterized image defect generator. (Abstract shortened by UMI.)...

Keywords/Search Tags:

Synthetic data, Classification, Training sets, Generator

PDF Full Text Request

Related items

1	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
2	Research On Data Stream Classification Based On Granular Computing And F-Rough Sets Extension
3	Research Of Improvement To The Density-based Method For Reducing The Amount Of Training Data And Application To KNN
4	Fault Diagnosis System Of Hydroelectric Generator Sets And Study Of Data Mining Technology
5	Research And Applications Of Classification Algorithms In Imbalanced Data Sets
6	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE
7	Optimal subsequence bijection and classification of imbalanced data sets
8	An Adaptive Sampling Ensemble Classifier For Learning From Imbalanced Data Sets
9	Text Classification Algorithm Based On Imbalanced Data Sets
10	Research On The Classification Of Imbalanced Data Sets And Related Problems