| The threat of malicious software to network security has attracted much attention.Every day,millions of new malicious software appear,affecting thousands of users.But most of the malware is based on reuse or reuse of the original malware.Attackers can design and reuse malware automatically,which makes the threshold of cybercrime lower and lower.The rapid classification and archiving of a large number of malicious samples can accelerate the detection of new malware and the version update of the same family malware.Therefore,we urgently need a detection technology that can be applied to the current rapidly changing malware ecosystem.In this paper,a two-level fusion algorithm framework is proposed to predict the function and family attribute tags of malware.At the same time,this paper analyzes the iteration samples of the same family and the same function malicious family,and puts forward the malicious software version difference analysis model,which can better detect the ability of procedural changes between the two versions.The main work of this paper are as follows:(1)According to the fact that malware of the same family can reuse the original function modules,the task of malware family classification is focused on.The feature library is constructed from the aspects of malware operation code,visible string,function library tune,PE section,compilation and data directory table.The text classification model Fast Text is introduced into malware family recognition and its effect is verified.(2)For most of the malicious software on Windows platform,the implementation of its software behavior function is usually implemented by using Windows API calls.Therefore,this paper extracts the sequence characteristics and statistical information characteristics of API functions.In the process of extracting API statistical information features,the Textrank keyword extraction method is improved,and the more representative API functions in each tag are selected.(3)In the first level model,Light GBM,CNN and Fast Text are used to describe different features.In the second level model,stacking multi model fusion method is used to further improve the accuracy and generalization ability of the model.The experimental results show that the accuracy of this method is 94.2%,and it has a good performance in other public data sets.(4)In order to further analyze the differences between versions of malware,an analysis model of malware version differences is proposed.This model makes up for the deficiency of Bindiff in function matching,and checks the programmability between two versions by distinguishing the difference of function calling process. |