Font Size: a A A

Research On Vulnerability Mining Based On Deep Learning And Programming Language Processing

Posted on:2021-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:H T FengFull Text:PDF
GTID:2518306047488154Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of software technology,the number of vulnerabilities is also proliferating,which brings huge risks to cyberspace security and makes vulnerability detection an essential topic of security research.Traditional vulnerability detection relies on manually analysis programs by security researchers or predefined rules,which cannot effectively respond to the rapidly growing number of software.There are already some methods for automated vulnerability detection combined with machine learning or deep learning.However,these methods only detect the vulnerabilities at the function level or code block level to predict whether a given program has vulnerabilities,which makes the results of vulnerability detection less interpretable and hard to exploit,also incur low accuracy and high false positive rate.On the other hand,existing solutions often treat program source code as plain text or extract features from code property graph based on dynamic analysis,which results in a large amount of data redundancy and brings additional computational overhead.To overcome these deficiencies,a deep learning-based vulnerability detection model is proposed to achieve interpretable and fine-grained vulnerability detection.The innovation and main work of this paper are as follows:(1)This model applies the hierarchical attention network to the program source code-based vulnerability detection model and consists of two-level recurrent attention layers.Each recurrent attention layer applies a Bi-GRU layer as an encoder to extract the hidden representation of the input vector,and an attention layer records the attention weight of each vector.This model applied the attention mechanism at both the line level and the token level of the code to locate the key features of vulnerabilities.Based on the hierarchical attention network,this model can effectively distinguish the importance of different lines and different syntax elements for vulnerabilities,which allows the model to more accurately locate relevant information about vulnerabilities and achieves fine-grained vulnerability detection.By analyzing the attention weight,this model can directly extract the key features of vulnerabilities,which makes this model well interpretable.In vulnerability detection work,it can provide key information for further vulnerability exploitation and repair.On the other hand,the pack padded method is applied on both the line level recurrent layer and the token level recurrent layer,which avoid the data loss caused by truncating and padding the final vectors,and improve the stability and accuracy of this model.(2)The program source code is structured text with grammatical meaning and needs to be transformed into a suitable vector representation in order to learn by using neural networks.This paper proposes a program processing method based on an abstract syntax tree to transform the program source code into its corresponding vector representation while retaining complete syntax information and reducing redundancy.This model first parses the program source code into an abstract syntax tree,extracts corresponding function nodes,and maps user-defined vector names and function names to a unified representation.Then extract all the tokens of the program to train the word embedding model,and map all the tokens to the corresponding vector representation.Finally,the extracted token sequence is rearranged according to the program segmentation symbols to obtain a hierarchical program vector representation.The programming language processing method is easy to parallelize,and it can completely retain the syntax and semantic information of the program source code.(3)In this paper,the evaluation experiments are based on two widely used benchmark datasets,CWE-119(Buffer Error)and CWE-399(Resource Management Error),based on SARD.Compared with other vulnerability detection tools and models,the experimental results show that the model in this paper is more effective than the state-of-the-art methods,the F1 score of this model achieves 85.1%(CWE-199)and 90.0%(CWE-399)on two benchmark datasets.In particular,this model can directly mark the importance of vulnerabilities in different lines and different tokens on the program source code,which is well interpretable and can provide effective information for vulnerability exploitation and repair.
Keywords/Search Tags:Vulnerability Detection, Program Language Processing, Deep Learning, Attention Mechanism
PDF Full Text Request
Related items