Font Size: a A A

Cost-sensitive, scalable and adaptive learning using ensemble-based methods

Posted on:2002-03-15Degree:Ph.DType:Thesis
University:Columbia UniversityCandidate:Fan, WeiFull Text:PDF
GTID:2468390011491822Subject:Computer Science
Abstract/Summary:
This thesis research focuses on the problems of cost-sensitivity, scalability and adaptivity as pertinent to the fields of Data Mining and Inductive Learning.; Knowledge Discovery in Databases and Data Mining (KDD) concerns itself with the theory and practice to automatically learn models from large storehouses of data, to effectively learn new knowledge from known facts. Inductive learning systems have grown in importance over the past decade as the field of KDD has matured and gained prominence both scientifically and commercially.; A number of simplifying assumptions have been made in prior work: (1) All data resides on a single processor, and resides entirely in main memory. (2) Each datum in a learning set is considered equally important and thus uniform costs are assumed. (3) All features in a dataset are freely acquired with no computational or monetary costs. This is unrealistic for many applications, such as medical diagnosis and intrusion detection. (4) A model is computed on the basis of complete knowledge. A learned hypothesis will be applied to scenarios that are completely represented in the training set.; We will concentrate on the problems of cost-sensitivity, scalability , and adaptivity. In cost-sensitive learning, the exemplar costs and feature testing costs are considered in both model construction and evaluation. In scalable learning, the focus is on the ability to learn from a dataset that is either much bigger than the main memory of the processor or distributed across a network of computers. A related issue is the ability to reuse existing classifiers for new applications, thus saving time and resources to re-train a completely new model.; We are interested in looking for general, algorithm-independent solutions to these problems. Our approaches work with different inductive algorithms. We have chosen to apply ensembles of classifiers, or multiple classifiers, for these tasks: Misclassification cost-sensitive boosting algorithm, Operational cost-sensitive ensemble approach, Scalable, distributed and on-line extension to boosting, Artificial anomaly generation algorithm, Anomaly and combined misuse and anomaly detection using artificial anomalies, Adaptive combined misuse and anomaly detection ensemble approach.; These techniques are new and have never been reported before. The effectiveness and advantages of our algorithms have been shown in empirical evaluations with credit card fraud detection and intrusion detection datasets. A few algorithms are being implemented in an intrusion detection system prototype. (Abstract shortened by UMI.)...
Keywords/Search Tags:Data, Intrusion detection, Cost-sensitive, Scalable, Learn
Related items