Font Size: a A A

Research On Cross-programming Language Malicious Code Detection

Posted on:2024-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y XuFull Text:PDF
GTID:2568307100961069Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of open-source communities and software supply chains has promoted high reusability of code,providing great convenience for developers,but also posing serious security risks.An increasing number of hackers are implementing attacks by embedding malicious code in open source projects,posing a serious threat to computing devices used by developers.Therefore,it is of great significance to conduct security analysis on open source code to detect malicious code in order to protect the safety of the open source community and software supply chain ecosystem.Existing malicious code detection methods for source code usually build recognition models based on samples of a single programming language.For the massive amount of code written in different languages in open source communities and software supply chains,a separate recognition model is needed for each language,making detection costly.Therefore,this paper aims to build a malicious code detection model with cross-programming language capabilities,mainly addressing two challenges:(1)For multi-language malicious code with significant differences in syntax structure,how to capture common features between them in the detection model;(2)For the class imbalance problem in the available malicious code training samples,how to improve the generalization of the constructed model for small categories.The main research work of this thesis is as follows:(1)We propose a cross-programming language malicious code detection method called Cross ASTMD,based on Abstract Syntax Trees(AST).The method first converts the source code into an AST to weaken the syntactic and structural differences between different language codes.Then,a Tree-Based Convolutional Neural Network(TBCNN)is used for representation learning of AST,capturing the hierarchical structure and dependencies between AST nodes.By using dynamic-pooling to extract language-independent features and constructing a consistent representation for different language codes,the representation is used as input for the classification layer to determine the maliciousness of the code.Experimental results show that Cross ASTMD can effectively learn common features among malicious code in multiple languages,achieving high accuracy and efficiency in cross-programming language malicious code detection tasks.Its Accuracy,Precision,Recall,and F1 scores are better than the baseline methods,reaching 97.74%,94.73%,96.42%,and 95.56%,respectively.(2)We propose a cross-programming language malicious code detection method based on pre-trained models,Cross LMD.This method first converts source code into token sequences and uses a pre-trained model trained on large-scale multi-language code data to learn vector representations of token sequences.Based on this representation,a cross-language detection model is built using multi-language malicious code samples.By leveraging large-scale pre-trained models,Cross LMD can better capture semantic relationships between code tokens in different languages and mitigate the impact of class imbalance issues.Experimental results show that Cross LMD achieves scores of 99.98%,99.97%,99.98%,and 99.98% in Accuracy,Precision,Recall,and F1,respectively,all of which are better than the baseline method.It also has a significant advantage in detecting malicious code in languages without training samples,demonstrating good generalizability.The research in this paper provides new ideas for constructing cross-programming language malicious code detection models,which will have a positive impact on the secure development of open source communities and software supply chain ecosystems.
Keywords/Search Tags:malicious code detection, cross-programming language, abstract syntax tree, pre-trained model
PDF Full Text Request
Related items