Sparse sequence modeling with applications to computational biology and intrusion detection

Posted on:2003-08-10

Degree:Ph.D

Type:Dissertation

University:Columbia University

Candidate:Eskin, Eleazar

Full Text:PDF

GTID:1468390011985626

Subject:Computer Science

Abstract/Summary:

Sequence models have been studied for some time in different contexts including language parsing and analysis, genomics, and recently in computer security in the area of intrusion detection. Many of these sequences can be characterized as “sparse”, that is only a fraction of the elements of the sequence have meaningful value. This is the case in many practical applications, such as the analysis of DNA sequences, where it is postulated that only about 1–3% of the sequence has any biological significance. Similarly, in intrusion detection, typically the evidence that an audit stream from a system contains an attack is often buried in a vast amount of irrelevant information. Modeling sparse sequences often requires allowing “softer” matches between a sequence and a canonical model such as allowing for mismatches. For example, the classical DNA signal “TATAAT” can often occur with several mismatches in any position such as occurrences “TATCAT” or “TAAAAT”. Computationally, this is problematic because there is an exponential number of models which can match a given sequence. Thus naive approaches to sparse sequence modeling are computationally complex in both time and space.; We present a new efficient framework for approaching sparse sequence modeling problems. We present techniques using this framework to address three computational problems: classification or transduction, outlier detection, and signal finding. Specifically, we demonstrate this framework with three applications, classification of amino acid sequences into protein families, outlier detection over sequences of system calls for intrusion detection, and signal finding for discovering transcription factor binding sites in gene promoter regions. This framework employs efficient data structures which index the sequences to allow iterating over all possible sparse models of the sequence. We modify several learning algorithms including Boosting, Support Vector Machines, and a new set of outlier detection algorithms to take advantage of these data structures. While still considering as rich a set of models as the naive approaches we can avoid the intractable time and space requirements.

Keywords/Search Tags:

Sequence, Intrusion detection, Models, Time, Applications

Related items

1	Algorithms Of Sequence Analysis And Theirs Applications In Intrusion Tolerance
2	Intrusion Detection Model Based On Time Series Study
3	Intrusion Detection Model Of Host System Call Sequence Based On Neighbor Algorithm
4	Research On Sequence Perceptible Industry Control System Intrusion Detection Technology Based On Machine Learning
5	Research And Implementation Of The New Multi-dimensionality Sequence Mining Arithmetic For Intrusion Detection System
6	Host-based Intrusion Detection And Prediction Algorithms
7	A Multi-step Intrusion Detection Algorithm Based On The Temporal Sequence Association
8	Study On Sequence Patterns Mining And Its Application In Intrusion Detection
9	An Approach For Database Intrusion Detection Based On The Event Sequence Clustering
10	Firewalls And Intrusion Detection System In Campus Network Combined Preliminary Applications