Font Size: a A A

Research Of Malware Identification Based On Machine Learning

Posted on:2021-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2518306050955119Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet's basic information facilities,more attack surfaces and techniques have been derived,resulting in a large number of security incidents.Among them,security incidents based on malware have a greater impact.With the development of artificial intelligence technology,we can use machine learning to detect malware.Compared with traditional detection technologies,this is a new perspective for analyzing and detecting malware.This paper designs and implements a toolbox for malware feature engineering.In addition,in order to meet the requirements of coarse-grained detection of malware,a twoclass detection model for malware based on machine learning is proposed.In order to more specifically identify the types of malware,we propose multi-class detection model for malware based on deep learning.Malware feature engineering toolbox implements the extraction,preprocessing,exploratory data analysis,characterization,and vectorization of malware's dynamic behavior data.The feature engineering toolbox includes data extraction,data pre-processing,exploratory data analysis,and data characterization modules.By analyzing the behavioral characteristics of different malwares,we encapsulate multiple characterization methods into the data characterization module from the dimensions of statistical learning and natural language processing,including statistical feature methods,N-grams,and word sequence indexing methods,are used to characterize and vectorize malware behavior data,and support upperlevel machine learning and deep learning algorithms.Machine learning-based binary detection model for malware,based on a feature engineering toolbox,extracts the global and local statistical characteristics of malware dynamic behavior data from a statistical perspective.From the natural language processing dimension,the 2-gram features and 3-gram features of the malware's dynamic behavior data are extracted using N-grams,and the natural language processing features of the malware's dynamic behavior data are obtained through SVD dimension reduction.We blend statistical features and natural language processing features.Based on the fused features,the Light GBM algorithm is used to detect malware using binary classification.Finally,we designed multiple sets of comparative experiments to evaluate the performance of this malware binary detection model.The experimental results show that the malware binary classification detection model based on fusion features and Light GBM has the best performance compared to methods such as decision trees and random forests,and the accuracy of binary classification detection of malware reaches 97.9%.Multi-class detection model of malware based on deep learning,using the word sequence index method to extract the dynamic behavior characteristics of the malware,combined with our designed Malware CNN deep learning model,to identify a variety of malware.Compared with general deep learning models,Malware CNN has three improvements.The first is to analyze and study the dynamic behavior of malware.It is found that convolutional layers and dilated convolutional layers in convolutional neural networks can build a large number of local features to a certain extent.Effectively model the long-term behavior of malware.The second is to add the attention mechanism to perform long-distance semantic modeling to learn the dynamic behavior patterns of different malwares.Third,through analysis and exploration of the actual behavior data of the malware,the hyper-parameters of the neural network layers such as the embedding layer and the convolution layer are set.The experimental results show that Malware CNN recognizes 4 different types of malware with an accuracy and recall rate of 89.6%.Compared with general deep learning models,we design the Malware CNN model with the best performance.
Keywords/Search Tags:Windows Malware, Machine Learning, Deep Learning, Dilated Convolution, Self-Attention
PDF Full Text Request
Related items