Font Size: a A A

False negative estimation: Theory, techniques and applications

Posted on:2009-03-27Degree:Ph.DType:Thesis
University:University of MinnesotaCandidate:Mane, Sandeep VijayFull Text:PDF
GTID:2448390002996000Subject:Computer Science
Abstract/Summary:
Model performance evaluation is an important task in data mining. Consider the problem of network intrusion detection, where millions of instances ("network connections") are analyzed from real-time data and classified into True ("network intrusion") and False ("normal connection") by a model. Unbalanced class distribution causes the supervised (or semi-supervised) models used for such datasets to classify relatively fewer (∼1100) instances as True, while the remaining majority instances are classified as False. The instances predicted True, being small in number, are analyzed further (e.g., manual labeling by an expert) to separate true positives (correctly classified True instances) from false positives (misclassified False instances). However, the instances predicted False cannot be further analyzed due to their high number (and consequently high labeling costs). Domain characteristics such as high data volume, need for real-time analysis, high labeling costs and unbalanced class distribution thus make it difficult to determine the false negatives (misclassified True instances) predicted by the model. This is a widely occurring problem also found in applications such as fraud detection, and multi-stage decision process. The false negatives in such applications usually have a high misclassification cost, and an estimate of the number of false negatives is required for evaluating the model's performance, e.g., using recall or ROC curve. This problem of estimating the number of false negatives predicted by a supervised (or semi-supervised) learning model, when it is difficult to obtain the actual class label of predicted negatives, is called the "false negative estimation problem".; This thesis addresses the important problem of estimating the number of false negatives predicted by a model. Initially, direct approaches based on sampling were used to estimate the false negatives, namely simple random sampling and inverse sampling. However, the relatively high manual labeling costs for these sampling approaches limits their practical usage. An indirect false negative estimation approach, based on "capture-recapture method ", having low manual labeling requirements, is presented. The intuitive idea is to estimate the number of True instances in the data and to use it to estimate the number of false negatives. To estimate True instances, multiple samples of the data (using separate models) are collected; the number of True instances in each sample is determined, and cross-tabulated into a contingency table. The cell frequencies in this table are modeled using contingency table analysis techniques, and an estimate is made for the number of True instances not detected in all samples. The choice of multiple models for sampling is critical for a good estimate of True instances, as the estimate's accuracy is improved by reducing the dependency between the samples. Making a theoretical analysis of the problem of training multiple models, this thesis proposes novel, information-theoretic greedy algorithms that select features for such models. Case studies in network intrusion detection and sales opportunity mining illustrate that the capture- recapture method with the feature selection algorithms provides a good estimate of false negatives with relatively low labeling costs.; A further extension of the problem of designing the samplings for capture-recapture method is studied in peer-to-peer networks. The use of random walks for independent sampling of nodes in the network is analyzed and then a capture-recapture method for estimating network size is proposed. Simulation study has shown that such an approach provides good estimates with relatively lesser communication overheads.; This lays the foundation of future research and applications of indirect estimation methods for false negatives. Many real-world applications have cost-sensitive, multi- class datasets; extending the proposed methods to such datasets remains an open probl...
Keywords/Search Tags:False, Applications, Data, Negatives, True instances, Problem, Network intrusion, Labeling costs
Related items