Study On Dynamic Behavior Based Malware Analysis And Detection

Posted on:2015-09-10

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Cao

Full Text:PDF

GTID:1108330464968944

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Malware is one of the major threat facing the computer systems. Code obfuscation and the prevalence of automatic malware generators via the Internet lead to an explosive growth of variant and unknown malware samples, which greatly challenges the traditional static signature based malware detection methods. To fight back, behavior based malware detection is proposed,which consists of three parts that all contribute to an accurate detection: behavior data capturing, behavioral feature abstraction and behavior based malware detection algorithm design. This dissertation studies behavior based malware detection around these three parts. The main contributions are outlined as follows.1. In the research of behavior data capturing, a malware behavior capturing system called Osiris is designed, which addresses the problem of semantic gap between virtual machine monitor and guest operating. Osiris uses the open sourced emulator Qemu as the core virtural execution component for analysis program. In Osiris, program behavior capturing is implemented at the virtual machine monitor layer which is the most privileged layer in a virtual machine system, therefore, malware can hardly escape from being analyzed. In detail, first, an API call monitor framework is inserted into the procedure of CPU emulation of Qemu, so that API calls invoked by the analysis program are monitored. Second, Osiris employs a bi-emulator architecture to emulate the whole Internet environment for malwares. Besides, common host events are also emulated to stimulate the hidden malicious behaviors of the analysis program. Experimental evaluations demonstrate that Osiris is a new analyzing tool to capture program behaviors. It lays a solid foundation for the following malware detection algorithm.2. In the research of behavior abstraction, minimal security sensitive behaviors are proposed to depict the behavioral features of the analysis programs, and the corresponding abstraction algorithm is also designed. The output parameter of one API call can be the input parameter of another API call, this is a kind of stable program behavioral features. Based on this observation, API calls that operate on one sensitive system resource are integrated altogether by low level data dependence analysis. Meanwhile, API call parameters are also abstracted. This is just the idea of minimal security sensitive behaviors. The abstracted behavioral features are then embedded into a high dimensional vector space so that it can be processed by almost all of the prevalent machine learning algorithms. Experiments on program similarity comparison, clustering and classification all demonstrate that the proposed behavioral feature abstraction method can depict the discriminant behavioral characteristics between malware and benign programs.3. Malware detection by combining features from both static and dynamic analysis is proposed. Basides, a feature selection algorithm called Boost FS which is based on the idea of totally corrective Boosting is proposed. Boost FS use Decision Stump as base learner. Because a Decision Stump is a decision tree with only one node, it can also be regarded as a selected feature. In each iteration, Boost FS iteratively searches for a base learner whose classification result vector is orthogonal to the classification result vectors of all the already trained base learners as much as possible.4. Malware detection is typically a cost sensitive learning problem. Therefore, cost sensitive Boosting algorithms are designed by strictly following the Boosting framework and theory. First, exponential loss function and Logit loss function of classification margin is revised into cost sensitive settings. And then, gradient decent in function space is used to optimize these cost sensitive loss functions. This leads to the proposed cost sensitive Boosting algorithm Asy B and Asy BL. It can be proved that Asy B and Asy BL converge to the optimal cost sensitive Bayes decision rule in extreme condition. Besides, Newman-Person decision rule is used to determine the cost factors in malware detection problem.5. It is acknowledged that Ada Boost is sensitive to noise data. To address this problem, a more noise resilient Boosting algorithm RBoost is proposed. RBoost strictly follows the Boosting framework and theory. First, it uses a non-convex loss function of classification margin called Savage2 loss function. Because Savage2 loss function puts restricted penalties for misclassified samples with large margins, it is less sensitive to noise data which are always misclassified by Boosting iterations. Second, RBoost use adaptive Newton step to compute the theoretically optimal base learners in each iteration, which is more numerically stable. These two steps all contribute to the robustness of RBoost to noise data. Experimental evaluations demonstrate that in a condition that the training and testing sets contain noise data, RBoost always gives better malware detection accuracies.

Keywords/Search Tags:

Malware Detection, Boosting, Feature Selection, Cost Sensitive Learning, Anti-noise

PDF Full Text Request

Related items

1	Three Kinds Of Cost Under The Environment For Sensitive Attribute Selection
2	Research On Two Algorithms For Cost Sensitive Feature Selection
3	Cost Sensitive Feature Selection Based On Data Correlation
4	Research On Cost-sensitive Feature Selection Problem
5	Cost-Sensitive Feature Selection Algorithms With Application In Software Defect Prediction
6	Research For Techniques Of Anti-evasion Of JavaScript Malware Detection Systems
7	Cost Sensitive Learning Method On Heterogeneous Data
8	Research On Cost - Sensitive Feature Selection Algorithm Based On Semi - Greedy Strategy
9	Research On Android Malware Detection Method Based On Multi_feature
10	Cost-Sensitive Feature And Instance Selection For Imbalanced Netwrok Abnormal Datasets