Font Size: a A A

Research On Malware Detection And Classification Based On Machine Learning

Posted on:2018-01-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:1368330623950363Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,information security has become an important means to maintain social stability and economic development.In recent years,large-scale network attacks are endless,seriously damaging personal privacy and economic interests.At the same time,the emergence of advanced attack mode,which is represented by APT,brings great potential threats to social infrastructure,public service departments and military science and technology organizations.From the analysis of the existing attack means,malware variants code and ?zero day?have become one of the most urgent problems facing information security.Especially in the face of massive network information and automated malicious code variant technology,the detection and recognition ability of traditional anti-virus system has been challenged unprecedentedly.In particular,massive network information andmalware protection technologies make the traditional anti-virus system based on artificial analysis has been an unprecedented challenge.Therefore,the malware detection and analysis technologiesbased on the automated method have become one of the hotspots in the field of information security.This paper mainly studies the four aspects of technology,such as automatic feature extraction,malware detection,malicious family classification and unknown malware detection.The main research results are as follows:1.Because different feature extraction techniques can only analyze malware samples from a single perspective,this paper proposes three feature extraction methods based on static analysis and dynamic analysis respectively.The methods include the texture feature based on Gabor filters,the opcode featurebased on control flow chart and API feature based on dynamic behavior trajectories.The feature extraction method based on Gabor filter is presented from the perspective of image texture structure.This method first converts binary malicious code into grayscale image,and then uses Gabor filter to extract texture features from different frequencies and angles.The main purpose of the opcodefeature extraction method is to obtain call relationships offunctional functions and blocks of code in malware.It first converts binary files into assembly files by disassembling tools,and then n-gram method is used to extract opcodefrom assembly files in the form of control flow.API feature extraction belongs to dynamic analysis method.First,we capture the behavior trajectory of the sample through the virtual honeypot system,and then use the information gain method based on frequency correction to extract the information about API in the behavior trajectory.Finally,by comparing the different feature extraction methods,we find that the comprehensive features based on texture features and Opcode features are more effective for malware analysis.2.The learning models of automatic malicious code detection technology are divided into shallow learning models and deep learning models.Compared with the shallow learning model,the deep learning is more powerful to express the complex function,which means that it is more suitable formining the distribution law of high-dimensional and complex feature space.Therefore,we propose a deep convolutional neural network model.Although the model has been widely used in the field of image processing,this method has the over-fitting problem as other depth models.Over-fitting means that the generalization ability of the hypothesis is weakened because of the high consistency of the pursuit hypothesis.In order to alleviate over-fitting,an optimized deep convolution neural network model is proposed in this paper.A Dropout layer is added between the layers of the model,and the output of each filter layer is normalized by using Batch-normalization method,while the normalized result does not change the original distribution rule of the data.The experimental results show that our optimization model not only can alleviate the over-fitting,but also has higher performance than other shallow learning models.3.To solve the problem of automatic classification of malicious code,we propose a malware classification technology based on negative correlation selective ensemble learning model.The proposed technology is mainly to solve two problems: one is that a single multi-classification model has limited classification ability,under-fitting and weak generalization ability;the other is the intrinsic conflict of integrated learning model,that is,the contradiction between the diversity and accuracy of the integrated model.Therefore,the selective ensemble model proposed in this paper uses the negative correlation principle to train the sub models in the training phase.At the selection stage,the K-means algorithm is used to select the sub class model with stronger difference.The algorithm not only solves the problems of single classification model,but also effectively reconciles the contradiction between diversity and accuracy of the ensemble model.Experiments in the fourth chapter show that our method not only has higher classification accuracy than the classical machine learning model,but also performs well in every malicious family.4.The main methods of unknown malware or ?zero-day?detection technology are clustering algorithms.In order to solve the inefficiencies of clustering algorithm in high dimensional sample space,a clustering algorithm based on Shared neighbor is designed.Compared with the classical clustering algorithm based on Euclidean distance and density,the method adopts a method based on sharing nearest neighbor to measure similarity of sample.If the two samples are similar,there must be K identical nearest neighbor samples between them.At the same time,the method also makes a distinction between edge points and outlier points by calculating the core points and the attraction of samples.The experimental results show that the clustering algorithm can not only detect unknown malware accurately,but also detect new malicious family.
Keywords/Search Tags:Malware detection, Machine learning, Convolutional neural network, Malware classification, zero day, Control flow chart
PDF Full Text Request
Related items