Font Size: a A A

Research On Software Vulnerability Detection Based On Transferable Code Data-Driven Language Model

Posted on:2022-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:P ZengFull Text:PDF
GTID:2518306785958189Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the advent of the Internet era,software is more and more widely used by people,and in the process of software development,there may be some vulnerabilities in the software due to the negligence of developers or some limitations of the software itself.It is used by criminals to cause user data leakage and property damage.Therefore,we need to find out the existing vulnerabilities from the source code as soon as possible and fix them in time,which can greatly reduce the harm caused by software vulnerabilities.Many researchers are devoted to the research field of software vulnerability detection.In the early research stage,most of the research methods are based on machine learning.However,traditional machine learning-based methods usually require experts to clearly define features.In most cases,manually defined features can be subjective,task-specific,and error-prone,and the quality of features can also be limited by the experience and knowledge of the practitioner.In order to extract high-quality data features and relieve the pressure of manual feature extraction by experts,deep learning has entered the research field of vulnerability detection.Deep learning can automatically extract features,and the generated feature representations usually have better generalization ability than manually extracted features.However,most of the deep learning-based vulnerability detection methods use traditional neural networks,such as Bidirectional Long Short-Term Memory(BiLSTM)and Convolutional Neural Networks(CNN).However,CNN is only suitable for extracting local features,and it is not effective for extracting long-distance dependent features.Although BiLSTM has a certain effect on extracting some long-distance dependent features,the length of some vulnerability functions exceeds the ability of neural networks to extract long-distance dependent features,so there are certain challenges in detection performance.To address the shortcomings of BiLSTM and CNN in vulnerability detection scenarios,this paper proposes a new vulnerability detection framework using advanced neural embeddings.The detection framework proposed in this paper is based on the code data-driven Code BERT language model,which is a large-scale pre-trained embedding model for natural and programming languages.It shows the latest research direction in various natural language processing and code analysis tasks,it has a stronger potential to understand code semantics than traditional models,and also demonstrates higher generalization ability.The specific work of this paper is as follows:(1)First,the framework proposed in this paper encapsulates the Code BERT model as a code representation generator,and then uses it in the vulnerability detection scenario and compares it with some existing methods.The Code BERT model is based on the stacking of encoder layers in the multi-layer bidirectional Transformer structure,so it has a multi-head self-attention mechanism,which can well extract the feature dependencies of long-distance vulnerability data.(2)Secondly,in order to further prove the vulnerability detection ability of the detection framework in this paper,this paper combines it with transfer learning for cross-project vulnerability detection,and makes some experimental comparisons with some existing methods.Aiming at the lack of code embedding model in C source code,this paper extracts knowledge from C source code and fine-tunes the pre-trained embedding model to better promote the detection of C function vulnerabilities in open source projects.In view of the serious data imbalance problem in real scenarios,the idea of code demonstration is introduced,and a large amount of synthetic vulnerability data is used to further improve the robustness of the detection method.The experimental results show that the detection performance of the vulnerability detection framework proposed in this paper is better than that of the comparison methods.
Keywords/Search Tags:Data-driven, CodeBERT, transfer learning, vulnerability detection
PDF Full Text Request
Related items