Font Size: a A A

Research On Statistical Feature Based Malware Classification Methods

Posted on:2018-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y FangFull Text:PDF
GTID:2428330623950739Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,malware has posed a huge threat to cyberspace.Machine learning based malware classification method has been a trend of malware research.Since the characteristics of malware families can be obtained from the study of malware classification,and these characteristics can help malware detector to find malware variants.So the research of malware classification is important.However,current classification studies have encountered problems in dealing with unbalanced malware datasets and in the selection and dimensionality reduction of malware features.On the one hand,there are a large number of malware in some malware families,and some malware families have a small number of variants.The decrease of classification accuracy would be resulted from the insufficient malware features.On the other hand,it is a difficult task to extract features and reduce dimension while the training time can be reduced and the classification accuracy is higher.The features can be extracted from many aspects,so the feature redundancy is inevitable.Such feature will reduce the classification efficiency and the classification accuracy.In view of these problems,this dissertation starts from dynamic feature and static feature.Based on the small or unbalanced malware datasets,this paper studies and improves the feature selection and feature dimensionality reduction algorithms.Features are extracted from different aspects,and ensemble learning algorithms are used to construct classification models.A feature selection method based on TF-IDF is proposed to select features which have strong discrimination power.And we continue to explore issues in feature representation and feature selection.By computing the weight of each feature,the features with higher weights are selected,and by weighting features with computed weights,the discrimination power of features can become clearer.An improved information gain algorithm is designed for feature selection and a feature processing method based on the transformation probability of function call graph is designed for graph feature.By calculating the sequence of each function call in a malware sample,the proposed algorithms can distinguish different function call sequences,and bin histogram is used to build malware feature space.Combining graph features and static features,an ensemble learning algorithm is used to construct a malware classification model.The experimental results show that our approach can classify malware in high F1-score while imposing low classification time in different kind of malware datasets.
Keywords/Search Tags:Malicious Code, Feature Selection, TF-IDF, Static Feature, Dynamic Feature
PDF Full Text Request
Related items