False negative estimation: Theory, techniques and applications

Posted on:2009-03-27

Degree:Ph.D

Type:Thesis

University:University of Minnesota

Candidate:Mane, Sandeep Vijay

Full Text:PDF

GTID:2448390002996000

Subject:Computer Science

Abstract/Summary:

Model performance evaluation is an important task in data mining. Consider the problem of network intrusion detection, where millions of instances ("network connections") are analyzed from real-time data and classified into True ("network intrusion") and False ("normal connection") by a model. Unbalanced class distribution causes the supervised (or semi-supervised) models used for such datasets to classify relatively fewer (∼1100) instances as True, while the remaining majority instances are classified as False. The instances predicted True, being small in number, are analyzed further (e.g., manual labeling by an expert) to separate true positives (correctly classified True instances) from false positives (misclassified False instances). However, the instances predicted False cannot be further analyzed due to their high number (and consequently high labeling costs). Domain characteristics such as high data volume, need for real-time analysis, high labeling costs and unbalanced class distribution thus make it difficult to determine the false negatives (misclassified True instances) predicted by the model. This is a widely occurring problem also found in applications such as fraud detection, and multi-stage decision process. The false negatives in such applications usually have a high misclassification cost, and an estimate of the number of false negatives is required for evaluating the model's performance, e.g., using recall or ROC curve. This problem of estimating the number of false negatives predicted by a supervised (or semi-supervised) learning model, when it is difficult to obtain the actual class label of predicted negatives, is called the "false negative estimation problem".; This thesis addresses the important problem of estimating the number of false negatives predicted by a model. Initially, direct approaches based on sampling were used to estimate the false negatives, namely simple random sampling and inverse sampling. However, the relatively high manual labeling costs for these sampling approaches limits their practical usage. An indirect false negative estimation approach, based on "capture-recapture method ", having low manual labeling requirements, is presented. The intuitive idea is to estimate the number of True instances in the data and to use it to estimate the number of false negatives. To estimate True instances, multiple samples of the data (using separate models) are collected; the number of True instances in each sample is determined, and cross-tabulated into a contingency table. The cell frequencies in this table are modeled using contingency table analysis techniques, and an estimate is made for the number of True instances not detected in all samples. The choice of multiple models for sampling is critical for a good estimate of True instances, as the estimate's accuracy is improved by reducing the dependency between the samples. Making a theoretical analysis of the problem of training multiple models, this thesis proposes novel, information-theoretic greedy algorithms that select features for such models. Case studies in network intrusion detection and sales opportunity mining illustrate that the capture- recapture method with the feature selection algorithms provides a good estimate of false negatives with relatively low labeling costs.; A further extension of the problem of designing the samplings for capture-recapture method is studied in peer-to-peer networks. The use of random walks for independent sampling of nodes in the network is analyzed and then a capture-recapture method for estimating network size is proposed. Simulation study has shown that such an approach provides good estimates with relatively lesser communication overheads.; This lays the foundation of future research and applications of indirect estimation methods for false negatives. Many real-world applications have cost-sensitive, multi- class datasets; extending the proposed methods to such datasets remains an open probl...

Keywords/Search Tags:

False, Applications, Data, Negatives, True instances, Problem, Network intrusion, Labeling costs

Related items

1	Research On Network Topology Information Technology In The Application Of Network Intrusion Detection System
2	A Model For Building Attack Scenarios Based On Correlation Mechanism Of Intrusion Detection And Vulnerability Scanning
3	Research On Correlated-Analysis Intrusion Detection Technology
4	Study On Distinguishing True And False Documents In Qing Dynasty
5	Clonal Selection Method Oriented Network Intrusion Detection
6	Research On Key Technologies Of Multiple Heterogeneous IDSes' Collaboration On Intrusion Detection In Large Scale Network
7	Research And Implementation On Distinguishing True-False Of ID Card Based On Image Processing
8	Study On The Social Cost And Management Research Of Network False News
9	Research Of The Intrustion Detection Based On Correlation Techniques
10	Research On Source Code Security Test Model Based On Result Feedback