Font Size: a A A

Semantic Understanding Of Vulnerability Source Code Based On Representation Learning

Posted on:2021-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:S D BaiFull Text:PDF
GTID:2428330602495152Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of software technology,software security has gradually become a hot issue which is widely concerned by the whole society.According to NIST statistics,90% of serious incidents caused by software security problems are caused by software security vulnerabilities.At present,domestic and foreign research in the field of software security vulnerability detection technology is mainly focused on the source code direction.The code is a human-oriented language,and the machine is difficult to understand.The source code of the vulnerability itself is highly abstract and complex,and most of its properties are difficult to define with concise machine language.Excellent understanding of code semantics usually helps to improve the accuracy of machine learning models.However,the detection accuracy is low due to insufficient semantic understanding ability.This is exactly one of the main problems facing security vulnerability detection research today.Therefore,in-depth research centered on code semantic understanding has become an urgent need for the development of security vulnerability detection technology.This paper develops research on vulnerability semantic understanding technology based on representation learning methods.The purpose is to provide reliability support for large-scale vulnerability pattern mining and vulnerability detection based on vulnerability semantics,improve the accuracy of software security vulnerability analysis and detection,and reduce false negatives and false positives.rate.The research work of this paper mainly has three points:(1)Semantic annotation of vulnerabilities.Aiming at the problem of the lack of labeled data sets in the field of vulnerability detection technology,this paper combines the existing code labeling technology and the specific research situation of this paper to design an overall semantic labeling scheme and propose three specific semantic labeling methods.The method,process and results of labeling examples are introduced in detail.Finally,through analysis and comparison,a tagging method with a structured system,rich tagging information,and more relevant to this research was selected for the semantic tagging of the vulnerability source code.(2)Deep abstract syntax tree representation.Aiming at the problem that the ordinary abstract syntax tree can only obtain the shallow features of the code semantics,this paper proposes a deep abstract syntax tree representation combined with vulnerability semantic annotation.Abstract syntax tree is widely used in code analysis and processing.It can obtain static structure information,data flow information and control flow information required by the code.Based on the characteristics of the abstract syntax tree,this paper uses deep traversal,number mapping,padding,and word embedding to extract deeper features from the ordinary abstract syntax tree.It also introduces in detail the method and method of jointly constructing vulnerability semantic vector samples by combining vulnerability semantic annotation and process.(3)Vulnerability detection based on representation learning.Based on the research content,this paper proposes a vulnerability detection scheme combining the ideas of learning and pattern matching.It mainly introduces the general idea of the scheme,the design and implementation of the algorithm and neural network.First,the labeled samples are used as the input of the neural network and trained to obtain the vulnerabilities.The neural network here uses a Bi-LSTM network based on a recurrent neural network.Since contextual information is critical to vulnerability semantics,Bi-LSTM is aiming at capturing longer dependencies.Therefore,Bi-LSTM can help detect the long-term dependency of the code forward and backward,which can effectively capture the characteristic representation of the vulnerability.The unlabeled samples are then provided to the trained network to learn a subset of the feature representation.Then execute the classifier algorithm on the labeled samples to generate the classification model,and execute the model on the unlabeled samples to complete the one-stage vulnerability classification prediction.Finally,perform pattern matching processing based on the prediction results of the first stage to obtain the two-stage Predict the results and complete the vulnerability detection.The experimental results show that compared with traditional code measurement methods,the method proposed in this paper combines semantic annotation and deep abstract syntax tree representation to have better performance in vulnerability detection accuracy.It fully shows that this method can better represent the characteristics of the source code of the vulnerability.It is verified that this method can improve the computer's ability to understand the semantics of the vulnerability source code.At the end of the thesis,the contents of the research are summarized,and the improvement directions of subsequent research are proposed.
Keywords/Search Tags:Security vulnerability, Representation Learning, Abstract syntax tree, Semantic annotation, Semantic understanding
PDF Full Text Request
Related items