Font Size: a A A

An Optimal Design-of-Experiment-Based Batch Sampling from Databases

Posted on:2017-09-10Degree:Ph.DType:Dissertation
University:Northwestern UniversityCandidate:Ouyang, LiwenFull Text:PDF
GTID:1468390011495426Subject:Industrial Engineering
Abstract/Summary:
The prevalence of large databases offers many new challenges and opportunities. When a supervised learning model needs to be built while the responses are missing or unreliable, it is infeasible to get the response values for all data points. In such case, one needs to design an efficient sample, by obtaining whose response and building a model with which, one can get accurate prediction results. Similar situations received attention in active learning community, whose goal is to achieve high accuracy while labelling response values for as few samples as possible. However, active learning methods are usually implemented in sequential mode and focus on the single objective of classification accuracy. In contrast, many applications require batch sample designs and have a variety of objectives that may include classification accuracy or variance of the estimated parameters. In such cases, it is essentially like design of experiment (DOE), yet there are some distinctions.;In this dissertation, we first present two applications where such DOE-based batch sampling can be very useful. One is using DOE-based approach to select a validation sample for logistic regression with error-prone medical records. The other is using DOE-based approach to design treatment and control groups in controlled trials. Through these two applications, we introduce the DOE-based batch sampling approach and develop an efficient heuristic algorithm for optimal sample selection. We demonstrate that DOE-based batch sampling can do a better job in prediction and model estimation than random sampling. Moreover, the advantage of our DOE-based method over random sampling is more obvious when the sample size relative to the whole data set is smaller.;We further explore the nature of different sampling design criteria, to provide insights and guidelines for future practitioners. We explore sample configurations of these different sampling criteria, compared their performance and investigated their robustness to change of assumed parameters and underlying model. We show that DOE-based batch sampling methods usually outweigh random sampling and entropy method both in terms of prediction and model estimation performance. Besides, we give recommendations on how to use designed sample for logistic regression when parameters and underlying model are unknown.
Keywords/Search Tags:Batch sampling, Model, Sample
Related items