Font Size: a A A

Research On Malware Classification And Clustering Based On Deep Learning

Posted on:2019-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:X MengFull Text:PDF
GTID:2428330566470946Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,all aspects of people's lives have been deeply integrated into the Internet.Life has become more convenient because of the Internet,but also makes people more vulnerable to threats from the Internet,especially security threats represented by malware attacks.On the one hand,the number of malwares has grown rapidly.on the other hand,while malware analysis and detection technologies have continued to advance,in order to evade detection,malwares have evolved in the course of development to produce new variants that are rapidly copied and disseminated,resulting in a number of security incidents broke out in a few years.The existing static analysis methods can not effectively solve the problem of malware obfuscation.And dynamic analysis methods have low efficiency and single execution path when dealing with large amount of malwares.In the malware analysis methods based on machine learning algorithm,compared with the deep learning algorithm,the malware analysis methods based on the traditional machine learning algorithm have the disadvantages of failing to automatically extract the features and manually designing the features.For these disadvantages,the paper takes researches based on the theory and methods of deep learning,from the vectorization of malware genes based on WordVec algorithm,vectorization of malware gene sequences based on Doc2 Vec algorithm,malware classification and clustering based on neural network.The main work is as follows:1.The methods of malware gene extraction and vectorization of malware genes based on Word2 Vec are proposed.(1)In the stage of malware gene extraction,from the level of malware behavior,the method uses the marching recursive algorithm to extract the malware gene sequences based on the automated disassembly component for massive malwares;(2)In the vectorization stage of malware genes,a prediction model of malware genes based on Word2 Vec is constructed.The model translates each malware gene into a high-dimensional real number vector,and uses spatial distance of vectors to express the semantic relevance of malware genes.The semantic reasoning and visual verification of the malware gene vectors prove that the vectorization model of the malware genes based on Word2 Vec can well represent the semantic features of the malware genes;2.A classification model SMM_CNN(Static Malware Matrix_Convolution Neural Network)based on Convolution Neural Network(CNN)is designed and implemented.The model performs two different tasks for two inputs.(1)For the confused malwares,the malware binary codes are converted into a matrixes as the inputs of the model.The features are extracted automatically from the data through the neural network for classification,mainly solving the problem that the features cannot be extracted effectively because it cannot be correctly disassembled;(2)Concerning unconfused malwares,the malware gene sequences which are mentioned in 1 are converted into matrixes.The features of the malware matrixes with semantic characteristics and sequence characteristics are automatically extracted by the convolutional layer and the pooling layer,and finally we achieve malware classification.Experiments show that the classification accuracy of the SMM_CNN model is 8.4% higher than that of the traditional image texture matching model for the confused malware.The static anti-confusion of the malware is achieved to a certain extent.For the unconfused malware,the SMM_CNN model performance better in malware classification,and the accuracy is up to 98.04%.3.An improved model of SMM_CNN,SMGS_RCNN(Sequential Malware Gene Sequences-Recurrent Convolution Neural Network)model,is designed and implemented.The SMM_CNN model is characterized by its relatively weak ability to obtain full-sequence information of malware genes.The SMGS_RCNN model integrates the Recurrent Neuron Network(RNN)based on SMM_CNN,using the forgetting mechanism,increasing preservation mechanism and long-term memory mechanism of LSTM(Long Short-Term Memory)components to extract the sequence features of malware gene sequences.Experiments show that compared with SMM_CNN model,the classification accuracy of SMGS_RCNN is improved by 0.83%.4.A malware representation method that characterizes the overall semantics of malware and sequence information of malware genes is designed and implemented,and is applied to the malware clustering.In the vectorization phase of malware gene sequences,a predictive model is constructed for the malware gene sequences.Each malware gene sequence is converted into a high-dimension real number vectors while the sequence information of malware gene sequences and the overall semantics information of malware are represented by the malware gene sequences vectors.Unsupervised learning algorithm is used to implement clustering of malware family.The experiments show that compared with the random coding representation method,our method obtains better clustering accuracy on the clustering model.
Keywords/Search Tags:Deep Learning, Malware, Malware Gene, Classification, Clustering
PDF Full Text Request
Related items