Software defect prediction technology has an important position in the software life cycle,and accurately locating the module where the software defect is located is conducive to improving software quality and saving software testing costs.Many scholars have transformed software defect propensity prediction into machine learning binary classification.Also,they proposed a series of effective defect prediction methods.However,there are still the following problems of existing methods in the practical application: unbalanced data,unclear classification boundaries and low model prediction accuracy,etc.,and how to solve these problems has become a hotspot in related fields.This paper carries out the modeling research of unbalanced data in software defect prediction from the data level and algorithm level,and the main work is as follows:(1)Aiming at the complexity of the distribution of class unbalanced data and the problems of sample overlap and unclear boundary after oversampling,a local density based BIRCH clustering adaptive oversampling with filtering algorithm F-LDBS(Local density based BIRCH clustering with filtering)is proposed.Class unbalanced data processing stage LDBS: The concept of local density of samples is introduced,and new samples are sampled according to the interpolation of subcluster density after clustering of defective samples,so that the new defect samples are scattered and distributed in the space of the defect dataset,and at the same time adapt to the imbalance within and between classes.CLOR(Closest List Overlapping Data Remove): A recent list class overlapping data cleaning algorithm based on domain search is proposed,which weighs the sensitivity and specificity of samples,uses proximity search technology to accurately identify overlapping area samples,and improves the problem of classification boundary ambiguity.In AEEEM and NASA,which are commonly used data sets for software defect prediction,the decision tree classifier is used to compare several oversampling methods to verify the effectiveness of LDBS oversampling algorithm and F-LDBS oversampling algorithm.(2)Aiming at the problem that traditional classification learning algorithms tend to ignore a few class samples when predicting class imbalanced datasets,resulting in high deviation of prediction models,Cat Boost ensemble learning is theoretically studied,and the grid search method is used to find the optimal parameters of the dataset AEEEM and NASA,and the parameterized Cat Boost ensemble learner is experimentally compared with various commonly used machine learning classifiers.The applicability and efficiency of Cat Boost integrated learner in software defect prediction are proved.(3)In the model construction stage,this paper builds the unbalanced data sampling integrated software defect prediction model Cat-LDBst based on the above research to maximize the class imbalance problem and improve the performance of the software defect prediction model,combined with F-LDBS oversampling and Cat Boost ensemble learning.Compared with the application of multiple sampling ensemble algorithms in unbalanced data processing,the superiority of Cat-LDBst prediction model and the rationality and feasibility of the main ideas of the model are verified. |