Font Size: a A A

The Study Of Malicious Code Detection Based On Data Mining And Machine Learning

Posted on:2010-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:X K ZhangFull Text:PDF
GTID:2178360302459611Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The detection of malicious code has become a difficult work because of the growing number and type of malicious code, and the popular of code obfuscation technology. Traditional signature-based detection is applied to all anti-virus software. However, it needs to obtain the signature of a malicious code before this kind be detected effectively, and the signature is usually acquired after infecting computer. This disadvantage makes computer system attacked by malicious code possible. Recently, data mining and machine learining technology is applied in the field of malicious code detection. It has been become the point study, because it can make use of data mining technology, dig meaningful patterns from a large number of code data, and make use of machine learning technology to help summarize the identification knowledge of known malicious code. Similarity research is carried out to help detect unknown malicious code. This paper uses data mining and machine learning technology to detect malicious code. After introducing the relevant background and theory of malicious code, data mining and machine learning, we discussed feature extraction and selection method. The contribution of this paper is listed followed:1. A malicious code detection system is implemented which adopts binary sequence variable length N-gram as the feature extraction method, weighted information gain as the feature selection method. Several classifiers like Decision Tree, SVM, Na?ve Bayes, are used to detect malicious code in the system.2. This paper uses a malicious feature extraction method called variable length N-gram, makes up for the deficiency of N-gram that can not extract features with different length. The results of compare our experiments with Kolter's which uses fixed length N-gram method prove that variable length N-gram is better than N-gram.3. This paper proposed a feature selection method based on weighted information gain. This method can select effective features more correctly by combining the advantage of information gain with classwise frequency. It makes up for the deficiency of IG(Information Gain) that takes into account the existence of a feature but ignores their frequencies. The experiments results prove that this method can effectively improve the detection and accuracy rate.Through above mentioned research and experiments, the high efficiency and accuracy are proved.
Keywords/Search Tags:Malicious Detection, Data Mining and Machine Learning, Variable Length N-gram, Weighted Information Gain
PDF Full Text Request
Related items