| While the rapid economic development is accompanied by the advancement of software technology,the role and impact of software in production life is becoming more and more prominent,but defects within software can pose hidden dangers to its quality.To improve software quality,it becomes important to predict defects within software.Defect prediction can help testers predict software defects in advance,so that they can allocate testing resources appropriately and improve software quality.However,many newly launched software projects lack sufficient data to build prediction models.Cross-project software defect prediction can alleviate this problem by using tagged data from other projects to build predictive models and applying them to new go-live projects.However,there are two major challenges in crossprojects: class imbalance in datasets and differences in data distribution.Therefore,solving these two problems has become a hot topic of current research.At present,scholars have studied the problems of class imbalance and data distribution discrepancy in datasets and proposed various solutions,which have achieved certain results,but there are still shortcomings.For example,intra-class imbalance(redundancy or overfitting due to excessive synthetic samples generated in high-density regions)and raw data noise(incorrect or abnormal samples in the original dataset),feature distortion,prediction ambiguity near decision boundaries and low accuracy of pseudo-labeling lead to poor performance of prediction models.In this thesis,a research on data pre-processing methods for cross-project software defect prediction is carried out to address the above problems,which include:(1)An oversampling method based on weighted relative and absolute densities(WRAD)is proposed to address the problems of intra-class imbalance and raw data containing noise when processing imbalanced data.First,a filter is defined using relative density and applied to the original data to filter noisy samples with relative density less than or equal to 1.Second,the relative density values and absolute density values are calculated to define the boundary value weights and sparsity weights.Then,the boundary value weights and sparse rate weights are weighted and summed and normalized so that more synthetic samples are located in the sparse region and class boundary region to resolve the intra-class imbalance.Finally,based on the normalized weight values,a few class samples are synthesized to resolve the inter-class imbalance.(2)The cross-project software defect prediction method(MCF-JDM)based on the combination of stream-shaped features and joint distribution matching is proposed to address the problems of feature distortion,fuzzy prediction near the decision boundary and low accuracy of pseudo-labeling when reducing data distribution differences.Firstly,global and local metastable features are extracted in the manifold feature space and their good geometric structure is preserved to resolve feature distortions.Then,the global and local metastable features are composed into new combined features of stream shape in a linear way to improve the distinguishability of different categories to solve the prediction ambiguity near the decision boundary.Finally,to improve the pseudo-label accuracy,an iterative update pseudo-label strategy is introduced to update the pseudo-label for the first time using the streamlined combined features and update it again in joint distribution matching.In addition,the model prediction performance is improved by combining WRAD and MCF-JDM in order to simultaneously address the problems of class imbalance and data distribution discrepancy as well as low pseudo-label accuracy.Finally,this thesis evaluates WRAD and MCF-JDM on two publicly available datasets.experiments demonstrate the superiority of the WRAD and MCF-JDM proposed in this thesis,and the combination of both can further improve software defect prediction performance. |