Research On Cross-programming Language Malicious Code Detection

Posted on:2024-01-25

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Xu

Full Text:PDF

GTID:2568307100961069

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The rapid development of open-source communities and software supply chains has promoted high reusability of code,providing great convenience for developers,but also posing serious security risks.An increasing number of hackers are implementing attacks by embedding malicious code in open source projects,posing a serious threat to computing devices used by developers.Therefore,it is of great significance to conduct security analysis on open source code to detect malicious code in order to protect the safety of the open source community and software supply chain ecosystem.Existing malicious code detection methods for source code usually build recognition models based on samples of a single programming language.For the massive amount of code written in different languages in open source communities and software supply chains,a separate recognition model is needed for each language,making detection costly.Therefore,this paper aims to build a malicious code detection model with cross-programming language capabilities,mainly addressing two challenges:(1)For multi-language malicious code with significant differences in syntax structure,how to capture common features between them in the detection model;(2)For the class imbalance problem in the available malicious code training samples,how to improve the generalization of the constructed model for small categories.The main research work of this thesis is as follows:(1)We propose a cross-programming language malicious code detection method called Cross ASTMD,based on Abstract Syntax Trees(AST).The method first converts the source code into an AST to weaken the syntactic and structural differences between different language codes.Then,a Tree-Based Convolutional Neural Network(TBCNN)is used for representation learning of AST,capturing the hierarchical structure and dependencies between AST nodes.By using dynamic-pooling to extract language-independent features and constructing a consistent representation for different language codes,the representation is used as input for the classification layer to determine the maliciousness of the code.Experimental results show that Cross ASTMD can effectively learn common features among malicious code in multiple languages,achieving high accuracy and efficiency in cross-programming language malicious code detection tasks.Its Accuracy,Precision,Recall,and F1 scores are better than the baseline methods,reaching 97.74%,94.73%,96.42%,and 95.56%,respectively.(2)We propose a cross-programming language malicious code detection method based on pre-trained models,Cross LMD.This method first converts source code into token sequences and uses a pre-trained model trained on large-scale multi-language code data to learn vector representations of token sequences.Based on this representation,a cross-language detection model is built using multi-language malicious code samples.By leveraging large-scale pre-trained models,Cross LMD can better capture semantic relationships between code tokens in different languages and mitigate the impact of class imbalance issues.Experimental results show that Cross LMD achieves scores of 99.98%,99.97%,99.98%,and 99.98% in Accuracy,Precision,Recall,and F1,respectively,all of which are better than the baseline method.It also has a significant advantage in detecting malicious code in languages without training samples,demonstrating good generalizability.The research in this paper provides new ideas for constructing cross-programming language malicious code detection models,which will have a positive impact on the secure development of open source communities and software supply chain ecosystems.

Keywords/Search Tags:

malicious code detection, cross-programming language, abstract syntax tree, pre-trained model

PDF Full Text Request

Related items

1	Research On Source Code Plagiarism Detection Based On Abstract Syntax Tree
2	Research And Application Of Automatic Scoring Scheme For C Programming Problems Based On Abstract Syntax Tree
3	Design And Implementation Of Abstract Syntax Tree Based Code Defect Detection
4	A Research On Program Coding-oriented Plagiarism Detection Techniques By AST-based Strategy
5	Automatically Based On The Abstract Syntax Tree And Static Analysis Of The Cloned Code Refactoring
6	Development Of Static Code Defect Detection Tool Based On Abstract Syntax Tree
7	Research And Implementation Of Malicious JavaScript Code Detection System For Applet Based On Deep Learning
8	Research And Design Of Source Code Homology Detection System Based On Text And Abstract Syntax Tree Compare
9	Optimization Of Deep Code Repair Model Based On Grammar Rules
10	The Duplicate Code Detection Based On AST