Research On Software Vulnerability Detection Based On Transferable Code Data-Driven Language Model

Posted on:2022-12-09

Degree:Master

Type:Thesis

Country:China

Candidate:P Zeng

Full Text:PDF

GTID:2518306785958189

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

With the advent of the Internet era,software is more and more widely used by people,and in the process of software development,there may be some vulnerabilities in the software due to the negligence of developers or some limitations of the software itself.It is used by criminals to cause user data leakage and property damage.Therefore,we need to find out the existing vulnerabilities from the source code as soon as possible and fix them in time,which can greatly reduce the harm caused by software vulnerabilities.Many researchers are devoted to the research field of software vulnerability detection.In the early research stage,most of the research methods are based on machine learning.However,traditional machine learning-based methods usually require experts to clearly define features.In most cases,manually defined features can be subjective,task-specific,and error-prone,and the quality of features can also be limited by the experience and knowledge of the practitioner.In order to extract high-quality data features and relieve the pressure of manual feature extraction by experts,deep learning has entered the research field of vulnerability detection.Deep learning can automatically extract features,and the generated feature representations usually have better generalization ability than manually extracted features.However,most of the deep learning-based vulnerability detection methods use traditional neural networks,such as Bidirectional Long Short-Term Memory(BiLSTM)and Convolutional Neural Networks(CNN).However,CNN is only suitable for extracting local features,and it is not effective for extracting long-distance dependent features.Although BiLSTM has a certain effect on extracting some long-distance dependent features,the length of some vulnerability functions exceeds the ability of neural networks to extract long-distance dependent features,so there are certain challenges in detection performance.To address the shortcomings of BiLSTM and CNN in vulnerability detection scenarios,this paper proposes a new vulnerability detection framework using advanced neural embeddings.The detection framework proposed in this paper is based on the code data-driven Code BERT language model,which is a large-scale pre-trained embedding model for natural and programming languages.It shows the latest research direction in various natural language processing and code analysis tasks,it has a stronger potential to understand code semantics than traditional models,and also demonstrates higher generalization ability.The specific work of this paper is as follows:(1)First,the framework proposed in this paper encapsulates the Code BERT model as a code representation generator,and then uses it in the vulnerability detection scenario and compares it with some existing methods.The Code BERT model is based on the stacking of encoder layers in the multi-layer bidirectional Transformer structure,so it has a multi-head self-attention mechanism,which can well extract the feature dependencies of long-distance vulnerability data.(2)Secondly,in order to further prove the vulnerability detection ability of the detection framework in this paper,this paper combines it with transfer learning for cross-project vulnerability detection,and makes some experimental comparisons with some existing methods.Aiming at the lack of code embedding model in C source code,this paper extracts knowledge from C source code and fine-tunes the pre-trained embedding model to better promote the detection of C function vulnerabilities in open source projects.In view of the serious data imbalance problem in real scenarios,the idea of code demonstration is introduced,and a large amount of synthetic vulnerability data is used to further improve the robustness of the detection method.The experimental results show that the detection performance of the vulnerability detection framework proposed in this paper is better than that of the comparison methods.

Keywords/Search Tags:

Data-driven, CodeBERT, transfer learning, vulnerability detection

PDF Full Text Request

Related items

1	The Research Of Automated Host Vulnerability Process Driven By Data
2	Research Of Data-driven Vulnerability Detection Technology
3	Data-Driven Android Malware Detection And Android App Vulnerability Discovery
4	Research On Software Vulnerability Prediction Method Based On Deep Transfer Learning
5	Research On Data-driven For Virtual Character Motion Style Transfer
6	Research On API Documentation Mining Driven By Software Knowledge And Data
7	Research On Software Buffer Overflow Vulnerability Detection Method Based On Deep Learning
8	An Automatic Vulnerability Data Collection And Processing System For Open-Source Software
9	Research On Vulnerability Detection Method Based On Deep Learning
10	Program Vulnerability Detection Through Learning On Code Text And Control Structure