Font Size: a A A

Defect Prediction Model Of LightGBM Software Based On Feature Selection And Mixed Sampling

Posted on:2022-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:S J ShuFull Text:PDF
GTID:2518306746468804Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Software defect prediction is a hot topic in the field of software quality.It predicts software defects based on the static code characteristics and code defect records in the historical submitted code,so as to avoid the occurrence of software defects.At present,many researchers predict software defects through machine learning.However,the traditional software defect prediction methods are not good for software defect prediction.This is because software defect prediction often has the following four problems: the feature dimensions of software defect data sets are different and there are many redundant features,the data distribution of software defect data sets is uneven,the cross entropy loss function in machine learning algorithm is easy to ignore the difficult to classify samples,and the hyperparametric performance of manually selected machine learning algorithm can not reach the best.In order to solve the above problems,this paper proposes a LightGBM software defect prediction model based on feature selection and mixed sampling.The software defect prediction model can solve the four problems existing in the software defect problem in turn.The main work of this paper is as follows:(1)For different feature dimensions and many redundant features,this paper uses the combined standardization method of standard scaler and robust scaler to standardize the original data set,scale the software defect data set and eliminate outliers to eliminate the influence of different feature dimensions.Using the combined feature selection method based on random forest and RFECV,45 SQ violation features are selected as the optimal feature subset from 166 SQ violation features.(2)Aiming at the uneven data distribution of software defect data set,this paper uses the mixed sampling algorithm based on SMOTE-ENN to balance the software defect data set.The SMOTE algorithm is used to oversample a few samples,and the ENN algorithm is used to clean the data of overlapping samples.The ratio of most samples to a few samples in the data set is balanced from 33.18:1 to 1.16:1.(3)For the cross entropy loss function is easy to ignore and difficult to classify samples,this paper uses the focal loss function through adjustment(?,?)parameter to change the attention of LightGBM algorithm to easy classification samples and difficult classification samples.(4)For the prediction performance of FL-LightGBM algorithm with default hyperparameters is not optimal,this paper uses Bayesian optimization algorithm to automatically tune 10 main hyperparameters in FL-LightGBM algorithm,so that the Bayes-FL-LightGBM algorithm after hyperparameter optimization can be improved in various evaluation indexes.
Keywords/Search Tags:Software defect prediction, RFECV, Data equalization, Focal Loss, LightGBM
PDF Full Text Request
Related items