| Software projects do not happen overnight,they are developed,run and maintained.There is a risk of defects being introduced during this process,so finding and fixing software defects is an inevitable part of a software project.Software defect prediction is the prediction of software modules that may be defective.It helps software testers to identify software modules that may be defective,thus facilitating the proper coordination of testing resources and improving software quality.Most software defect prediction work is currently classified as a dichotomous problem,i.e.predicting whether a software module has a tendency to be defective.However,in the face of software projects with a large number of software defects and limited testing resources,the impact of predicting whether a software module is defective or not is minimal,and predicting the number of software module defects would be more conducive to improving software quality.To address these issues,this paper integrates the impact of data sets and learning algorithms on software defect quantity prediction work,and proposes an integrated learning-based software defect quantity prediction method.The main work is reflected in the following three aspects:Firstly,in terms of data pre-processing,research is conducted to address the data imbalance problem that exists in most software.The dependency relationship between software classes is introduced into the smote algorithm,which is commonly used to deal with this problem,and its method of selecting synthetic instances of the target class from the k-th nearest neighbours is improved.The effectiveness of the improved smote algorithm is demonstrated,and the experimental results show that the prediction accuracy of the software defect number prediction model trained using the dataset processed by the improved smote algorithm is higher.Secondly,a two-stage feature selection method is proposed to address the problem of many irrelevant and redundant features in the dataset.In the first stage,the density peak clustering method is used to cluster the features in the dataset,and in the second stage,the relevance value of each feature to the category is calculated and ranked by combining the two common feature selection methods,and then the corresponding number of ranked features in each cluster is selected according to the size of the cluster The top features in each cluster are then selected according to the cluster size to form a feature subset.Based on a comparison between the three learning algorithms and the two common feature selection methods,the experimental results show that the performance of the software defect quantity prediction model can be improved by using the two-stage feature selection method.Thirdly,in order to improve the performance of software defect count prediction models,a new method for building software defect count prediction models,ELDCP(Ensemble Learning for Software Defect Count Prediction),is proposed based on the idea of integrated learning.Firstly,the integrated learning algorithm Adaboost.R2 is used to train three learners respectively based on R2 based on three classical algorithms,and then using the model fusion algorithm Stacking to fit the three learners to form the final software defect count prediction model.The performance of the software defect quantity prediction model was compared with that of single algorithms such as decision tree regression,linear regression and Bayesian ridge regression on the selected dataset,and the experimental results showed that the software defect quantity prediction model constructed using ELDCP has better accuracy and stability. |