| In recent years,the rapid development of internet technology has led to the production and spread of a large amount of malware,which has severely impacted internet security.The diversity of malware and its variants have made it difficult to cope with traditional human-based detection methods.Consequently,recent research has combined malware detection with machine learning and deep learning,constructing efficient and accurate classifiers for malware classification and detection.However,on the one hand,traditional machine learning and deep learning methods have not performed well in selecting features for malware,and the selected features are easily evaded by the variants of malware.On the other hand,most well-performing malware classification models are inexplicable black-box models,making it impossible to intuitively represent the reasons for decision-making,which is also unconvincing.Therefore,the main requirement in the field of malware classification is currently to achieve interpretability of the classification results of the malware classification model based on constructing an efficient and accurate malware classifier.This thesis utilizes the dynamic running sequence of malware APIs as features,combined with the API function semantic module,to construct an XLNet-based interpretable malware classification model and achieve excellent results on mainstream datasets.In addition,an interpretable malware classification system was developed and deployed for real-world environments.The specific work and innovations of this thesis are as follows:(1)Design and implementation of the API function semantic extraction module.This thesis selects the malware API call sequence as the feature that can capture the unique behavioural patterns of malware and is difficult to hide.In addition,a semantic extraction module for API function naming is designed to further segment API functions into finergrained segments.The Doc2 vec module extracts semantics from segmented API function phrases,and a mapping relationship from functions to vectors is constructed,providing richer feature information for subsequent sequence processing.(2)XLNet-based classification module design and implementation.This thesis uses XLNet,an excellent self-attention model in the NLP field,as the sequence embedding module for API call sequences.XLNet combines the characteristics of autoregressive language models to solve the problem of the self-encoding language model.XLNet allows the length of API sequences to be unrestricted,preserving rich sequence behavioral information.In addition,this thesis reduced the model parameters and model depth based on XLNet,using a smaller parameter size and fewer stacked layers to achieve excellent classification results with better training time and accuracy than mainstream Recurrent Neural Network and Transformer models.The F1 score on the public dataset Catak was increased to 0.65,and the AUC score was increased to 0.903.The reduced model can also run better on smaller computers.(3)Implementation of interpretability for malware classification results.On the basis of implementing a high-precision malware classifier,this thesis combines the attention mechanism of the XLNet architecture.It sets up a complete interpretability system for malware classification,which can provide interpretability for the model classification results.The attention weight matrix is extracted from XLNet,and the degree of influence of each API function in the sequence on the sentence vector is calculated.Malicious subsequences are extracted according to the degree of influence.Researchers can quickly locate malicious modules through API call sequences and conduct targeted analysis.The interpretability mechanism can significantly improve the efficiency of researchers’ work,enhance the accuracy of analysis,and improve the credibility of the analysis results.(4)MalNet malware classification system design and development.This study conducted a detailed design and demonstration of the system requirements analysis and architecture design,focusing on analysing malware running functions,API call sequence extraction and restoration functions,and analysis result output functions.The entire system was developed and tested based on this.The system can combine with the opensource sandbox Cuckoo to comprehensively analyse malware.This thesis has completed the development and practical testing of the MalNet interpretable malware classification system,which achieves efficient and accurate malware classification and provides credible explanations for the classification results.The system is highly portable,requires low computer requirements,and helps solve the problem of malware proliferation and assists security personnel in further analysis,achieving the expected goals. |