Case Studies For Semantic Aware Statistical Machine Learning Applications In Code Security Problems

Posted on:2011-03-31

Degree:Doctor

Type:Dissertation

Country:China

Candidate:D G Kong

Full Text:PDF

GTID:1118360305966584

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

In the past two years, the economic loss caused by virus, worms, and different kinds of network attacks has been reached to approximately 8.5 billion dollors, in the mainland, there are billions and billions of host computers, network, and website are attacked, compromised, and tampered. On one hand, different kinds and various functional malare (e.g., virus, worms, rootkit, spyware, botnets, adware) appear more and more, the attack methods and procedures become more and more complicated and cunning; on the other hand, the vulnerabilities existing in the information system increase quieckly, and there are more and more bugs and errors found in the software based information systems. The driven by the pursuit of economic values makes it still the No.1 problem for the malware detection and defense. Attackers upgrade and complicated the attack approach without a moment's stop, and the defenders propose new defense techniques based on the new appeared attacks (e.g., patch the software, inject the vaccine to the information system); furtherly, the attackers present novel anti-detection and evasion techniques, which makes it mandatory for the defenders to update the existing defense techniques. The attackers and the defenders play the game with each other, in the dynamic evolution process, which makes the saddle point changing a bit. Statistical machine learning (SML for short) originates from the statistics, and is good at relationship inference and automatic knowledge discovery, which has produced very good results in text analysis, video analysis, image understanding and voice signal recognition. We analogy the detection and the generation of the malware as a game between cat and mouse, can SML play the role well as the "cat", can SML makes the state-of-the-art malware analysis and detection more powerful? The difficulty of this problem is that the basic requirements of the security information systems are not well aligned with the requirement of machine learning applications. For example, in information security field, we have very rigorous tolerance on the low false positives and false negatives; the results obtained from machine learning is lack of explanations, and thus there is a great gap between model results and the reality of the system configurations; machine learning algorithms have to considers about different kinds of attacks and evasion techniques adopted by attackers. Nearly all problems in machine learning field are a game process between attackers and defenders, and we have to consider them both from the attackers' perspective and defenders'perspective, which helps the solutions of the problems.For the specific application field of code analysis, based on the combination of the domain knowledge, we present the following problems as our research problems. A) Can machine learning techniques be used in malware analysis or code analysis? B) How much effect it wll have for the malware analysis (which can also be easily extended to code analysis or system security)? How to get the most out of the machine learning techniques? We will refine these abstract problems into several sub-problems, and answer them with concrete cases. Q1:How to extract polymorphic worm fingerprint? Q2:How to make attribution analysis on the Polymorphic shellcode? Q3: How to detect obfuscated malware? Q4:In multi-thread programs, how to eliminate the undertiministic bugs related to time sequences? To be exactly, the malware we focus on in this paper consists of two types, the first type is malware based on network packet(e.g., polymorphic worm, the shellocde propograted by the network), the second type is malware based on the file (e.g., the compromised executables and dynamic link library files). What is more, we also analyze a case related to multi-thread program security.To deal with the problems proposed above, we do the researches in the following respects:extract the polymorphic worm signatures from the polymorphic worms by integrating semantics with the statistic features; make attribution analysis on the polymorphic shellocde by integrating semantics with the statistic features; detect the obfuscated malware by integrating semantics with the statistic features; infer the influence of the time sequence on the underministic bugs in the context of the multi-thread programs.In the above work, we make the contributions as follows. A) We present a novel code analysis method by integrating semantic features with statistic features, which can be used to detect and classify the malicious packets and dike files. It incorportats the advantages of the statistic analysis, and is helpful for accurate description; also it is close to the essence of the semantic of code, and caputer the basic characteristics. B) We propose a dataflow based state-transition-graph signature, used to generate the signatures from the worm packet. The dataflow analysis can be used to filter the noise from the network packet, and thus the state-transition-graph signature can be used to capture both the semantics and statistic characteristics from the packets. C) We propose a shellcode attributon analysis algorithm by integrating static taint analysis and mixture of Markov Model for the anlaysis of the attributions of the shellcode. The static taint analysis preserves the semantic related bytes in the packets, and the Mixture of Markov model obtains the statistic structure of the packets, which is more robust than only statistic based approach, and more quantitively powerful than the only taint analysis based approaches. D) we presnt a obfuscated malware detection method by integrating the control flow information with system-call features, for the detection of the obfuscated malware, the control flow and the system call information acquire the semantic features of the code, and the statistic features can complement those semantic features for a much accurate classification of the obfuscated malware. E) We present a Hidden Markov Model to depict the context influence on the running of the multi-thread programs, the context (including the priority, system overload, running time etc.) acquires the influence of the environment, and we further quantitive those influences, which is helpful for the analysis of the underministic bug.From the above researches, we find that statistical machine learning coupling with (or reflecting) certain kinds of semantic can be effectively used in malware defense. The techniques about semantic aware polymorphic shellcode signature generation and the attribution analysis have appeared in the software product of the DayZeroSystems, the boosting approach by integrating multi-level features for the detection of the obfuscated malware has been applied in the next generation intelligent security products (e.g., Damballa Inc). We are optimistic about more applications about the above techniques in the future intelligent security detection problems.

Keywords/Search Tags:

attack, defense, machine learning, data flow analysis, semantic, hidden markov model, conditional random field, ensemble learning

PDF Full Text Request

Related items

1	Study On Text Emotion Analysis Based On Supervised Learning
2	The Research And Application Of Markov Random Field In Machine Learning
3	Highlight Goal Event Detection In Soccer Videos
4	Research On Intrusion Detection And Network Attack Behavior Prediction
5	An Action Recognition Method Using Improved Hidden Conditional Random Field Model
6	Research On Parameter Injection Attack Detection Method Based On Machine Learning
7	Research On Stochastic Attack Oriented Industrial Control System Attacks Modeling And Detection
8	Research On Road Scene Segmentation Model Based On FCN And Conditional Random Field
9	Applications Of Machine Learning Approaches To Biological Sequence Analysis
10	Research Of HMM Stock Index Trading Model Based On Ensemble Learning