Software defects and vulnerabilities are the root causes of software security problems.Predicting defects and vulnerabilities is an important part in the field of software testing.It is helpful to allocate testing resources reasonably and is an important guarantee to improve software quality and security.Aiming at this research,this paper proposes a combined sampling and XGBoost method to predict defect,and a vulnerability prediction method based on combined N-gram feature extraction and heterogeneous integration algorithm.The main contents are as follows.Firstly,the research status of software defects and vulnerability prediction is analyzed,and the static prediction method based on machine learning is deeply analyzed.Then,the problems and solutions such as quantitative characterization of source code features,high feature dimension and class imbalance in dataset are studied.Secondly,for the problem of high feature dimension in the defect data set based on structured metric,combined with heuristic search strategy for feature analysis,a feature selection algorithm combining recursive feature elimination and ridge regression is proposed.ADA-RENN combined sampling algorithm is used to balance the distribution of data samples,and then the processed data is used to build a prediction analysis model based on XGBoost.Thirdly,due to the methods based on software metric cannot accurately identify the characteristics of vulnerabilities contained in code.Vulnerabilities,as a special subset of defects,can seriously threaten the security of software systems.Therefore,this paper proposes a vulnerability prediction method based on code segment level.After in-depth study of code semantic features,a combined N-gram feature extraction algorithm is proposed,which combines feature information with different granularity and different window sizes,and constructs a vector space model with TF-IDF algorithm to realize the representation from source text to real matrix.Furthermore,by taking advantage of the performance advantages and structural differences of different classifiers,heterogeneous integrated classifier based on Stacking strategy is constructed to improve the accuracy and generalization ability of the model.Finally,this paper conducts experiment on MDP defect dataset in C program language by using defect prediction method based combination sampling and XGBoost,and perform vulnerabilities prediction in Code Gadget dataset by using a method based on combination N-gram feature extraction and heterogeneous integrated algorithm,and the effectiveness of the method was verified. |