Font Size: a A A

Novel Techniques for Improving Classification Systems by Incorporating Experts

Posted on:2014-02-17Degree:Ph.DType:Dissertation
University:Polytechnic Institute of New York UniversityCandidate:Attenberg, Joshua MFull Text:PDF
GTID:1458390005992573Subject:Computer Science
Abstract/Summary:
This manuscript presents novel techniques for incorporating the domain knowledge and wisdom of human "oracles'' into the data mining workflow. Tasked with building predictive models for a real-world, web-scale prediction task, we quickly realized that many data mining techniques, including state-of-the-art research, fail to perform as advertised. Assumptions that could be made in the lab might not hold in reality. To overcome these difficulties, we would need to employ human effort in clever ways, overcoming unexpected deficiencies when collecting data for model training, performing predictions, or evaluating the quality of predictions a model may make.;Leveraging human knowledge for data mining or machine learning tasks is by no means anything new. Typically, constructing and monitoring a predictive machine learning system requires labeled example data. While some situations may elicit labels naturally, in others human effort must be employed to "manually'' examine each instance considered, applying an appropriate label based on observations. These labeled instances are most frequently used during or prior to the training phase of the data mining process, generating the data that is considered during model induction. Gathering labels for selected examples, however, is not the only way human effort can be employed to aid the efficacy of a data mining system. Humans can go out and seek examples they believe will be useful for a model's training. Additionally, labeled examples can be gathered for a model deployed in production, generating performance estimates, and building a better understanding of how a model behaves. Finally, examples can be labeled as substitutes for a model's imperfect label predictions, applying human expertise at inference time.;In the following research, we present several deficiencies in the existing techniques for gathering training data for data mining systems, offering alternative techniques that we demonstrate to be much more effective. We also show problems that exist in traditional model evaluation, problems that are particularly acute in web-scale predictive tasks. We provide an alternative approach that uses a game-ified design to aid the task of evaluating a model. Finally, we present a novel situation for applying human resources to predictive inference, giving a utility-optimizing approach, and demonstrating that our approach is, in fact, also a good way of gathering additional training data for model improvement.;The techniques presented herein are proven not only through simulation in the laboratory setting, but in reality---these ideas were forged from the demands of production. Being employed in a production system validated these ideas far beyond what is typical for machine learning research. Still, to demonstrate that the ideas discussed here can generalize to a variety of tasks, we go on to support our claims with a variety of simulations.
Keywords/Search Tags:Techniques, Data mining, Novel, Human, System, Model
Related items