Font Size: a A A

Malware Detection Techniques Based On Data Mining And Machine Learning

Posted on:2014-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:B H FengFull Text:PDF
GTID:2298330434450875Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Malware detection techniques based on data mining(DM) and machine learning(ML) have the advantages of automation, intelligence and high detection accuracy for the unseen malware, which are the hotspot in the fields of malware detection. In order to cope with the problem of the single feature description method and the weak generalization capability of classifier in the malware detection techniques based on DM and ML currently, a malware detection technology based on multi-features and selective ensemble learning is proposed. The main research contents and innovations are as follows:Firstly, the definition and classification of malware and the advantages and disadvantages of a variety of detection methods are summarized. Malware detection techniques based on DM and ML are mainly researched and the detection framework and principles are analyzed and described in detail.Secondly, multidimensional static features are used to describe the information features of malware,19-dimensional static structure features are extracted from the file structure layer of malware, and different lengths of n-gram sequence features are extracted from the byte layer, the opcode layer, the semantic layer respectively, which formed the initial feature set by combining with each other. In order to control the scale of the sequence features, three effective methods are adopted:①Limiting the search scope of the byte sequence features;②Only concerning about the sequences which compose of common opcode and key API;③Reducing the sequence features by combining information gain(IG) with rough sets theory. Then adopt different classification algorithms to evaluate the sequence features in the initial feature set which has processed by dimensionality reduction algorithms, choosing the features which contribute most to classification, thereby obtain the optimal feature subset.Finally, the acquired feature subset is used to train15different types of base classifier respectively according to the feature types, and the optimal base classifier subset is selected for each feature type based on accuracy value, AUC value and difference degree which calculate from wrongly classified sample, and multi-classifier is assembled by using the relative majority voting approach, then use the weighted majority voting approach to conduct decision fusion of the selective ensemble result which corresponds to each feature and present the final classification information.Experimental results show the effectiveness of the method that proposed in the paper, they obtain better detection accuracy and generalization capability on the experimental data sets, which has certain application values.26figures,19tables, and69references.
Keywords/Search Tags:data mining, multi-features, rough set attribute reduction, selective ensemble learning, malware detection
PDF Full Text Request
Related items