Font Size: a A A

Clustering Analysis Of Malicious Code Based On N-gram Feature Extraction

Posted on:2021-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:H B SuFull Text:PDF
GTID:2428330602979458Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the popularization of network and the progress of computer technology,computer information security is facing a great threat.Malicious code is one of the main means of attack.The growing number and technology of malicious code have brought a lot of troubles to people's lives,also led to the economic losses of individuals and enterprises,and even threatened national security.With the development of malicious code detection technology and anti-detection technology,more and more malicious code brings tremendous pressure and severe challenges to analysts.This paper uses static analysis method to detect malicious code.Firstly,the n-gram semantic feature extraction method is introduced to segment the multiple semantics of malicious code,which is mapped to the opcode operating code features of malicious code.In the process of feature extraction by traditional n-gram method,the length of feature sequence is fixed,and some feature sequences with rich semantic features will be lost.Therefore,this paper proposes a method of malicious code feature extraction based on mixed n-gram,which combines multi-dimensional features of fixed length feature short sequence based on information gain,and selects the optimal data dimension to get the feature subset of malicious code.In view of the limitations of traditional feature selection algorithms,such as too many data dimensions,low representativeness of features and long running time,this paper introduces Pearson correlation coefficient feature selection method,and combines the K-means clustering algorithm in machine learning,proposes a malicious code clustering detection method for multi fusion feature selection.Compared with the clustering results of DF feature selection method,the method proposed in this paper has higher F1 value and purity index than DF feature selection method,and lower entropy value than DF feature selection method.Furthermore,using the trained malicious code clustering detection model to design a malicious code detection system,which is used to detect the effective categories of malicious code samples,has better practical application value.
Keywords/Search Tags:N-gram feature extraction, Multi fusion feature selection, K-means clustering algorithm, Malicious code detection system
PDF Full Text Request
Related items