Font Size: a A A

Research On Feature Extraction And Classification Of Malware Based On Machine Learning

Posted on:2021-05-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S LiuFull Text:PDF
GTID:1368330614472176Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the current complex network environment,malware,which illegally occupies user terminal equipments or network equipments and illegally steal privacy data,has been spread quickly in various ways.Malware poses serious threats to network and information security.In the past decades,malware detection has caught the attention of both researchers and security vendors.To detect the evolving malware more correctly,this paper proposes several new methods to mitigate the problem of extracting features to classify malware using machine learning techniques.These methods have been proven to have better classification accuracy and recognition ability compared with other methods.The main contributions in this paper are as follows:(1)A multi-layer learning Bo VW model is proposed to extract the visual features of malware.And malware is classified using these features by machine learning techniques.In this model we introduce a new concept of “bag of visual words”,where instead of analyzing the binary executable files of malware directly,we analyze the gray image converted from it.By using Bo VW model,more robust features are obtained through the process of splitting blocks,clustering and getting the bag of visual words,the features obtained here are more flexible than global features and more robust compared to local features.Bo VW model is evaluated on several malware databases using various classifiers and leads to the state-of-the-art classification performance.(2)A multi-feature fusion methodology is proposed to describe the features of malware,which can increase the detection accuracy of malware variants.Firstly,we propose an improved method of LBP.Then,both global features(GIST)and local features(LBP or dense SIFT)are combined together in order to construct combinative descriptors of malware in gray-scale images,the newly combined descriptors are more anti-confusion.Using these descriptors,the performance of classification is greatly improved in contrast to traditional methods,especially for those samples that have higher similarities in the different families,or those have lower similarities in the same family.Experimental results show that the method proposed in this paper is much more effective than traditional methods.When applying on the more confusing dataset,the accuracy rate of classification has been greatly improved.(3)A new method is proposed to measure the simhash values similarity of operation code sequences of malware function blocks.This method can effectively solve the problem of classification difficulty of malicious code,caused by simhash being too sensitive.This method extracts the opcode sequences of function blocks by the reversed analysis of malware,calculates that texture features of simhash gray-scale image,then classify malware samples using these features.It solves the problem that Hamming distance is difficult to judge the simhash values similarity of function blocks.Preliminary experimental results show that the new method can get more effective features with higher information density comparing with the traditional malware visualization method,it also achieves higher efficiency and classification accuracy.What's more,the method can be regarded as a new approach for reducing dimension of simhash value.(4)An unsupervised detection approach of malware based on probability topic model is studied.This paper proposes an unsupervised malware identification method using latent Dirichlet allocation(LDA).It extracts the probability distribution of the latent “document-topic” and “topic-word” from samples,then use them as features of samples to build new malware detection framework to train model and test malware.What's more,our method solves the problem that the topic number in LDA model needs to be specified beforehand using the perplexity and different steps,which evaluates the best numbers of “topics” quickly and automatically.Finally,it analyzes the semantics of “document-topic” and “topic-word” aggregating results in assembly instructions,which explains the latent semantics of features.Experimental results show that the method is more discriminative,which have better classification results than other methods,while providing accurate discrimination of novel malware variants.(5)A dynamic feature description and classification method of malware based on heterogeneous information network(HIN)is proposed.Through sandbox,the "API" and "DLL" information of samples are acquired dynamically,and HIN is constructed.Four meta graph schemes about “File”,“API” and “DLL” are proposed.An improved random walk strategy is applied to obtain the context information of the object nodes in the meta graph schemes,which is used as the input of CBOW model in order to get network embedding of word vectors.The method of principal angle is improved by voting to get the classification result of multiple meta graph schemes.Compared with others,the method greatly improves the classification accuracy of malware based on the features of each meta graph when limited information is available.And this method is more general and reproducible.
Keywords/Search Tags:malware, malware visualization, LDA, simhash, heterogeneous information network, machine learning
PDF Full Text Request
Related items