Font Size: a A A

Hypothesis margin based weighting for feature selection using boosting: Theory, algorithms and applications

Posted on:2014-06-18Degree:Ph.DType:Thesis
University:Northeastern UniversityCandidate:Alshawabkeh, MalakFull Text:PDF
GTID:2458390005988044Subject:Engineering
Abstract/Summary:
Feature selection (FS) is a preprocessing process aimed at identifying a small subset of highly predictive features out of a large set of raw input variables that are possibly irrelevant or redundant. It plays a fundamental role in the success of many learning tasks where high dimensionality arisesas a big challenge. In this thesis, we took an unusual approach for using boosting as an effective FS by utilizing the training examples' mean margins. A weight criterion, termed Margin Fraction (MF), is assigned to each feature that contributes to the margin distribution combined in the final output produced by boosting. We argue that using the MF is more favorable for several reasons. First, boosting hypothesis margins have been used both for theoretical generalization bounds and as guidelines for algorithm design, and thus, a natural goal is to find learners (features) that achieve a maximum margin. Second, current boosting-based feature selection methods measure the relative importance of features based on the Confidence Ratio (CR) of the learned base hypothesis. However, while a feature may have a large CR, it will not contribute to a good overall margin unless its "conditional" margin is also large.;The thesis mainly consists of two parts. In part one, we establish a rigorous theoretical and mathematical basis for the proposed weighting and selection methodology, and we describe how to extend this methodology to handle the presence of imbalanced data; by defining a new weight metric, termed AUC Margin Fraction (AMF), that characterize the quality of a set of features based on the maximized Area Under ROC curve (AUC) margin it induces during the process of learning with boosting. Based on this we design two different embedded-based FS algorithms, the SBS-MF and the SBS-AMF. We then investigate the effectiveness of the proposed methods through extensive comparisons with other algorithms using real-world data.;In part two, we apply the proposed SBS-AMF method to design a real intrusion detection system (IDS) of virtual server environments utilizing only information available from the perspective of the virtual machine monitor (VMM). VMM-based IDSs break the boundaries of current state-of-the-art IDSs. They represent a new point in the IDS design space that trades a lack of program semantics for greater malware resistance and ease of deployment. To test the effectiveness and robustness of our proposed VMM IDS, we use different classes of servers, virtual appliances, and workloads, as well as different classes of malwares. Our experimental results show that SBS-AMF achieves significantly better detection performance on the data sets tested using the Local Outlier Factor anomaly detection algorithm (LOF), and we obtained on average 96% detection rate and 5% false alarm rate. These results indicate that sufficient information exists in features selected by SBS-AMF to build real IDS that is not susceptible to the characteristics of the attack behavior, or to specific workload.;Due to the growing popularity of Graphics Processing Units (GPUs) in general-purpose computing domains we applied this parallel computing approach to accelerate the LOF method, to enhance the detection speed of the proposed VMM IDS, as near real-time performance is needed in order to detect any malicious activity before the system becomes fully compromised. With the GPU-enabled LOF CUDA implementation we achieved more than a 100X. (Abstract shortened by UMI.).
Keywords/Search Tags:Feature, Selection, Margin, Boosting, Using, LOF, IDS, Algorithms
Related items