Font Size: a A A

Research On Malicious Code Detection Technology Based On Word Embedding And API Call

Posted on:2024-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:W G XinFull Text:PDF
GTID:2568307058452634Subject:Engineering
Abstract/Summary:PDF Full Text Request
Malicious code is a computer program used for attacking,destroying,stealing information or other illegal purposes,which endangers the safety of user’s information and property.With the popularization and development of computer and Internet technology,malicious code has become one of the main threats in the field of network security.Therefore,how to effectively detect malicious code has become a research focus in the field of network security.In the field of dynamic analysis,the main method is to obtain the API call sequence of the software code and analyze the characteristics of the API call to realize malicious code detection.At present,the common method for extracting API call features is the N-gram method,but this method can only reflect the local information of API calls,without considering the semantic characteristics and timing relationship of API calls,and cannot adapt to the current changing malicious code behavior.In view of the above problems,based on the timing analysis method,this paper studies a malicious code detection method that can represent semantics and accurately extract API call features.First of all,in order to encode the semantic features of the API call function,the API call encoding method based on natural language processing word embedding is studied,and three methods of Word2 vec,Glove,and Fast Text are used to encode the word embedding of the API call sequence,and the API call is extracted.Semantic features of the sequence.Then,two time series analysis methods are used.On the basis of API call word embedding coding,the Shapelet-based time series analysis method and the Text CNN method based on natural language processing text classification are used respectively to realize malicious code detection by effectively extracting the time series features of the code API call sequence.The main work of this paper includes:(1)Word embedding techniques for natural language processing are employed to encode dynamic API calls in this paper.This method solves the problem that the traditional coding method cannot express the semantic information of the API call and the problem that the API call sequence in text form cannot be used for distance calculation.This paper uses word embedding coding technology to convert API calls into word vectors.The experimental results show that the encoding vectors of functions with similar operations have an aggregation effect,that is,the word vectors corresponding to the API call functions with similar semantics are close in distance in the multi-dimensional space,indicating that the word embedding encoding technology can effectively mine and represent semantic information.(2)For the first time,the integration of word embedding technology and Shapelet-Transform technology using the dynamic analysis method is employed in this paper.By extracting the time series features of Shapelets,the precise and automatic extraction of malicious code time series features is realized,and the machine learning classification model is combined to realize malicious code detection.Aiming at the high complexity of the Shapelet-Transform method,the GPU optimization algorithm is used for acceleration.The experimental results show that this method can further improve the accuracy of the detection by using the time series analysis method after the initial semantic information capture with the word embedding code.Sexual analysis provides decision-making basis for subsequent tracking,location and in-depth analysis of malicious code.(3)A malicious code detection method is proposed in this paper,which combines Fast Text word embedding technology and the text classification Text CNN model.The API call sequence of the software code can be regarded as a text sequence,so it can be processed by using related technologies and methods of natural language processing.A convolutional neural network is employed to extract abstract features of software code,based on Fast Text’s word embedding coding,while Text CNN network structure is utilized to detect malicious code.This method solves the problem that the Shapelet-Transform method cannot handle long sequences and large-scale data sets.Experimental results show that the training speed of this method is faster and the accuracy rate is higher.(4)Design and implement a malicious code detection system based on word embedding and API calls,and introduce and demonstrate the functional modules of the system in detail.
Keywords/Search Tags:Malicious code detection, API call sequence, word embedding, Shapelet, convolutional neural network
PDF Full Text Request
Related items