Font Size: a A A

Web Application Client Input Constraint Detection Based On BERT Pre-Training Model

Posted on:2022-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:D Y FuFull Text:PDF
GTID:2518306551971149Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In the Internet era,Web applications are developing rapidly and are becoming the core business in many fields.They are an essential carrier for information sharing and resource acquisition.Their security and reliability have also become the critical concerns of many companies and researchers.The interaction in web applications usually strongly relies on the user's interactive input,and the lack of experience of programmers or lack of security awareness leads to input constraint vulnerabilities.This vulnerability often leads to applications being attacked by the network,causing information leakage and system damage And other immeasurable losses.With the powerful computing functions of the web application client and the urgent needs of users for real-time information and real-time interaction,many data verification functions have been migrated to the client to reduce the server's performance overhead.Therefore,the input constraints of the web application client Detection is an essential and vital feature.The existing research on input constraint detection is mainly divided into norm-based and static analysis-based methods to generate test cases to detect input constraint functions.The former relies heavily on the quality of development documents,and the test coverage and test accuracy are relatively low.The latter is too time-consuming to generate test cases,and the test results in actual projects are not ideal.The common problem is that they need to execute the test code through test cases to discover the vulnerabilities.The test results cannot visually show what type of constraint caused the vulnerability of the constraint verification function.Developers have to re-understand themselves based on the test results.Constrained code and then repeatedly modified and tested,resulting in a long development cycle and low efficiency.In response to the above problems,we hope to help developers understand the constraint verification code they write and find the loopholes in the input constraint code in time during the code writing process.This paper proposes a method based on the BERT pre-training model to realize constraint code detection research.The main work content and contributions of this paper are as follows:(1)Construct input constraint code data set through semi-supervised learning method.The primary premise of the research on constraint code detection using natural language processing technology is the data set.Since there is no public input constraint code data set in the existing research,this article first collects the Java Script code part of the Coede Search Net code data set published by Git Hub.Secondly,to extract the code related to the input constraints,it is necessary to label and classify the data set.However,the high-quality data labelling in the code field is time-consuming and labour-intensive,and the labelling cost is enormous.The traditional semi-supervised learning methods often only use Labeled data or unlabeled data is prone to over-fitting.Therefore,this paper adopts a semi-supervised text classification method based on Mix Text to interpolate the implicit space of labelled code and unlabeled code and mine the code.The implicit relationship between and using the unlabeled code's information to classify the input constraint code while learning the labelled code.The experimental results show that the classification accuracy rate on 200 labelled codes and50,000 unlabeled codes is 79%.Moreover,through the ablation study,it is proved that the interpolation classification of the code in the {7,9,12} layer in the BERT model has the highest accuracy.It lays the foundation for the research on constraint code detection in this paper.(2)We proposed a method of constrained code entity recognition based on the combination of Code BERT and CRF.Because the syntax of Java Script code is numerous and complex,and the implementation methods are diverse,the semantic features of input constraint code can not be accurately extracted by the heuristic rules designed by hand.To solve this problem,this paper divides the constraint code into four parts: constraint function name,constraint variable,constraint condition judgment and constraint feedback behaviour.Because the code also belongs to the text,this paper applies the named entity recognition method in natural language processing to the entity recognition of constraint code for the first time.It uses the constraint entity label and bieo label to label the constraint code.Because the pre-training model has better model generalization ability and less feature engineering dependence,this paper proposes a method based on the codebert pre-training model and conditional random field(CRF)to extract constraint code features.The experimental results show that the accuracy,recall and F1 value of codebert-CRF model are 81.12%,79.87% and80.88%,respectively,which are better than other mainstream models.The named entity recognition model has higher prediction accuracy,proving the feasibility and effectiveness of the method.(3)A method of constrained code classification based on machine learning is proposed.To accurately present the semantics of the constraint code in natural language and help developers understand the code and find the loopholes in the constraint code in time,this paper uses machine learning methods to classify the various features of the input constraint code into corresponding code semantic descriptions.Since there is currently no research on the types of input constraints,this paper first classifies the input constraint codes commonly used in actual development through card classification.Secondly,it extracts the semantic features and semantics of input constraint codes through feature design and feature selection.Keyword feature,information feature;because the input constraint code may belong to multiple constraint types,this article finally compares the effects of three different types of machine learning multi-label classification methods in constraint code classification.The experimental results show that the average accuracy of the extreme random tree algorithm classification based on ensemble learning is 78%,which is better than the other two models,and ML-KNN performs better in terms of unique error loss and coverage,which are 0.197 and 0.203 respectively.
Keywords/Search Tags:Input constraint detection, Pre-trained model, Semi-supervised learning, Named entity recognition, Multi-label classification
PDF Full Text Request
Related items