| Nowadays,the Internet has become an indispensable part of people’s routine life.As the Internet brings extremely convenience to people’s life,the security issues of Internet now are regarded as the sword of Damocles,which maybe potentially cause great harm to the social life at any time.Due to the fact that the Windows is still the most widely used desktop operating system worldwide,which directly results in that PE virus files are harmful in most extensive range.Besides,the number of newborn PE virus each year is still increasing,which overwhelms security vendors.Therefore,automatic clustering PE viruses according to their own family has important practical significance,which gives contributions to the further analysis of themselves.Based on that without considering the time characteristics of n-gram during the extraction of PE viruses’ static features,this paper first studies the theory of Word2vec,then proposes an algorithm in which we can extract the n-gram timing characteristics from the PE viruses.This paper also studies the format of PE file and the principle of clustering algorithms,designs and implements a PE virus files clustering system,and validates this clustering system by using the algorithm proposed in this paper.The main contents and results of this paper are shown as follows:(1)Analyze the extraction of PE viruses’ static features without considering its timing feature,and propose an extraction algorithm of n-gram timing feature.Currently,the researches of extracting PE static features are focused on the information gain of n-gram,API calls,string information,etc.,ignoring the n-gram timing characteristics.Therefore,we first analysis of the structure of PE file format in detail,and then present a timing feature extraction algorithm.(2)Design and implement the n-gram timing characteristics algorithm.This paper uses Word2vec to convert the n-gram of PE file into word vectors.In order to reduce the dimension of word vectors,the K-means algorithm is used to cluster n-gram with similar context semantics into one class,using the word vectors as a measurement of similarity between words.(3)Design and implement a PE viruses clustering system.The system consists of two parts.The first part uses SGD algorithm to verify the effectiveness of timing feature;in the second part,the timing feature is applied to clustering PE viruses,and the effects of K-means algorithm and density peak algorithm are compared.(4)Evaluate the PE viruses clustering system.A set of virus samples are used to test PE virus clustering system.The results show that the system we proposed achieves the desired goals.This timing feature extraction algorithm has a certain practicality. |