Software defect prediction technology can effectively assist software testing to guarantee software quality.However,Class imbalance can make the model pay more attention to the non-defective modules and insufficient training of the defective modules,resulting in a greatly reduced classification performance of the defective modules.A large number of irrelevant and redundant features can reduce the prediction accuracy.In addition,a single classifier is difficult for the prediction of defect data with diverse distribution.The main contents are as follows:First of all,aiming at the problem of class imbalance of software defect data,an ADASYNTomek combined sampling algorithm is proposed.Adaptive method is used to focus on samples that are difficult to learn from minority class and the TomeLink method is used to ensure that the data set is balanced while reducing noise samples and improving data quality.Secondly,aiming at the problem of high dimensionality of data,a deep feature selection algorithm based on comprehensive sorting and cross-recursion elimination(CR-RFECV)is proposed.Comprehensively analyze the correlation between features and classes through information gain rate and chi-square value to eliminate irrelevant features,use Spearman correlation coefficient to analyze redundancy between features to remove highly redundant features,and the cross recursive feature elimination method of ridge regression is used to make a deeper selection.In this way,the problems of poor generalization ability of single feature selection and insufficient stability of the method can be solved,and the calculation accuracy can be improved while ensuring rapid dimensionality reduction.Moreover,because the model built by a single classifier is not comprehensive enough to predict the distributed software defect data,it is necessary to integrate multiple base classifiers for improvement.Therefore,an ATW-Bagging ensemble classification algorithm is proposed.The algorithm considers from both the training and decision stages.In the training stage,the diversity of data distribution is introduced while all samples are considered comprehensively,and ADASYNTomek method is used to balance training subsets with different imbalanced rates.In the decision stage,different base classifiers are selected to increase the diversity of base classifiers,and weighted integration is performed based on the cost of misclassification.When constructing a software defect prediction model,the data is preprocessed briefly and the CR-RFECV algorithm is used to reduce the dimensionality,and then the ATW-Bagging ensemble classification algorithm is used to predict the software module,and the final prediction class is obtained.Finally,The CR-RFECV algorithm is compared with other dimension reduction methods.The ATW-Bagging ensemble classification algorithm is compared with the single classification algorithm,the traditional Bagging algorithm and current newer software defect prediction algorithm to verify its effectiveness. |