Font Size: a A A

Research On Key Technologies Of Source Code Vulnerability Static Detection Based On Machine Learning

Posted on:2023-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiFull Text:PDF
GTID:1528306914976669Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Computer software is playing an increasingly.important role in economic and social development.Unfortunately,software security is seriously compromised by vulnerabilities.Once software vulnerabilities are discovered and exploited by attackers,information systems will suffer from information leakage,service interruption and data tampering.Given the frequency of cyber attacks in recent years,software vulnerabilities have become an important factor affecting cyberspace security and economic development.Vulnerability detection is a key method to improve software quality and ensure its security and smooth operation.Vulnerability detection is an important academic research direction.Traditional vulnerability detection techniques,such as fuzzing,symbol execution,and taint analysis,are inefficient and rely on manual participation.Current researches show that machine learning has unique advantages in alleviating the bottleneck of traditional methods in terms of automation and efficiency.Therefore,machine learning based vulnerability detection technology is attracting more and more attention in the academic and industrial circles and has become a hot research direction in the field of software security.Although existing machine learning methods have made remarkable achievements in vulnerability detection,there are still some issues to be addressed.Firstly,the strong syntax structure,variable vocabulary,noise and other characteristics of programming language affect vulnerability representation learning.Secondly,coarse detection granularity affects the effectiveness and practicability of vulnerability detection.Finally,under the condition of limited training samples,the vulnerability detection model is affected by the feature distribution of the domain where the training data resides.Bias towards source domain distribution significantly compromises the performance of the vulnerability detection model in new scenarios and new data.To address the above challenges,this paper delves into machine learning based vulnerability detection.Three core research points are focused on,namely,a vulnerability detection framework using minimum intermediate representation,a hybrid network for fine-grained vulnerability detection and interpretation,a framework that combines graph embedding and domain adaptation for cross-domain vulnerability detection.Details of the research are shown below.(1)Aiming at the issues of long-term dependency,variable vocabulary and limited labeled samples in vulnerability representation learning,a software vulnerability detection framework based on minimum intermediate representation is proposed.Firstly,the vulnerability sample is transformed into a minimum intermediate representation at the source code level.In this process,only vulnerability related parts are retained to alleviate the long-term dependency issue and noise interference.Secondly,word embeddings are pre-trained on the unlabeled extended corpus to alleviate the scarcity of labeled samples.Finally,multipled convolutional neural networks are used to learn the advanced features of vulnerability and train the classifier.Experimental results show that the minimum intermediate representation,unsupervised pre-training,and convolutional neural network can promote the representation learning of vulnerability features.This framework achieves a significant improvement compared with the existing methods.(2)Aiming at the coarse detection granularity,which affects the accuracy and practicability of vulnerability detection,this paper proposed a vulnerability detection and interpretation model based on a hybrid neural network.A vulnerability description based on security-sensitive operations is proposed.According to this description,fine-grained vulnerability intermediate representation is generated using the context flow graph in the static single assignment form.A hybrid neural network is proposed to learn vulnerability features from the intermediate representations and construct a classifier.The convolutional layer and recurrent layer are used to learn local and global features of vulnerability respectively.According to the vulnerability description,the vulnerability detection results at the intermediate representation level are traced back to the source code.Then,vulnerability interpretations in the graph form are generated to assist in vulnerability understanding and patching.The experimental results show that the vulnerability intermediate representation based on the security-sensitive operation is accurate and concise.The hybrid neural network can learn comprehensive features of vulnerabilities.The proposed model outperforms existing methods at detection and interpretation performance.(3)Aiming at the compromise of the machine learning model when it is applied to a completely new project,this paper proposes a cross-domain vulnerability detection model based on graph embedding and domain adaptation.This model uses a graph-based representation learning method to obtain the initial feature representation of samples.Specifically,samples are transformed into code property graphs first.Then,the syntactic and semantic information is aggregated from neighbor nodes and edges to eliminate the long-term dependency issue.The initial feature representation of a sample is generated by aggregating all the nodes of the graph.This model further uses deep domain adaptation to learn a feature transformation using labeled data in the source domain and unlabeled data in the target domain.This feature transformation maps samples in the source domain and the target domain to a common data distribution while keeping discriminative for vulnerability detection.Experimental results show that the graph embedding method can learn comprehensive vulnerability characteristics and promote the classification of vulnerability samples.After domain adaptation,the proposed model achieves better detection performance than existing methods in cross-domain vulnerability detection tasks.
Keywords/Search Tags:vulnerability detection, program language processing, representation learning, graph embedding, domain adaption
PDF Full Text Request
Related items