Font Size: a A A

Sparse sequence modeling with applications to computational biology and intrusion detection

Posted on:2003-08-10Degree:Ph.DType:Dissertation
University:Columbia UniversityCandidate:Eskin, EleazarFull Text:PDF
GTID:1468390011985626Subject:Computer Science
Abstract/Summary:
Sequence models have been studied for some time in different contexts including language parsing and analysis, genomics, and recently in computer security in the area of intrusion detection. Many of these sequences can be characterized as “sparse”, that is only a fraction of the elements of the sequence have meaningful value. This is the case in many practical applications, such as the analysis of DNA sequences, where it is postulated that only about 1–3% of the sequence has any biological significance. Similarly, in intrusion detection, typically the evidence that an audit stream from a system contains an attack is often buried in a vast amount of irrelevant information. Modeling sparse sequences often requires allowing “softer” matches between a sequence and a canonical model such as allowing for mismatches. For example, the classical DNA signal “TATAAT” can often occur with several mismatches in any position such as occurrences “TATCAT” or “TAAAAT”. Computationally, this is problematic because there is an exponential number of models which can match a given sequence. Thus naive approaches to sparse sequence modeling are computationally complex in both time and space.; We present a new efficient framework for approaching sparse sequence modeling problems. We present techniques using this framework to address three computational problems: classification or transduction, outlier detection, and signal finding. Specifically, we demonstrate this framework with three applications, classification of amino acid sequences into protein families, outlier detection over sequences of system calls for intrusion detection, and signal finding for discovering transcription factor binding sites in gene promoter regions. This framework employs efficient data structures which index the sequences to allow iterating over all possible sparse models of the sequence. We modify several learning algorithms including Boosting, Support Vector Machines, and a new set of outlier detection algorithms to take advantage of these data structures. While still considering as rich a set of models as the naive approaches we can avoid the intractable time and space requirements.
Keywords/Search Tags:Sequence, Intrusion detection, Models, Time, Applications
Related items