Software defect prediction(SDP)is a key component of software quality assurance technology.It can construct a prediction model based on the historical data in software projects to identify the potential defects,so as to optimize the allocation of testing resources and improve the quality of software products.Early research mainly focuses on the design of software metrics and within-project defect prediction,but the lack of labeled data makes the prediction model often unable to achieve the desired effect.Therefore,researchers begin to use the model trained by historical data from other projects to conduct cross-project defect prediction(CPDP).To better deal with the CPDP scenario in which the source and target projects have different metric sets,researchers further propose the cross-project heterogeneous defect prediction(HDP)problem and put forward corresponding solutions.With the depth of research,HDP methods have improved in different aspects,but there are still some urgent problems to be solved:(1)The evaluations of existing HDP methods are usually carried out under dif-ferent experimental settings,so the conclusions of these studies are often inconsistent.It also leads to the lack of unified evaluation of the research progress of HDP until now.(2)According to the metric processing method,metric transformation-based HDP method can better eliminate the heterogeneity between source and target data,but ignore the impact of redundant metrics on defect prediction;although the metric selection-based HDP method can filter out potential redundant metrics,it inevitably causes the loss of some valuable metric information.At present,how to make up for the shortcomings of two types of HDP methods has not been studied.(3)Software metrics usually represent the specific attributes of modules.They have important reference value for practitioners.To improve the prediction effect,existing studies usually perform complex transformations on the original metrics and then obtain the abstract metrics without actual meanings.It weakens the interpretability of HDP methods.Therefore,how to enhance their interpretability while improving the predic-tion performance has important research meaning.(4)Due to the differences between project modules,they usually have different effects on the defect prediction process.It makes HDP with mixed project data vulnerable to the negative impact of noise modules.In addition,how to fully utilize the latent discriminative information of a large number of unlabeled modules has not been well studied.This dissertation analyzes and studies the above-mentioned unsolved problems,and achieves a series of valuable research results:(1)An empirical study on existing HDP methods is conducted.First,a compre-hensive literature review of the existing HDP research is performed.On this basis,the types of HDP methods are defined and the main improvement directions of the current HDP methods are summarized.Afterward,the prediction performance of existing HDP methods is compared from different perspectives under a unified ex-perimental setting.Through extensive experiments on 30 projects,we find that:metric transformation-based HDP methods can usually achieve better prediction re-sults,while metric selection-based ones have better interpretability;dealing with the class imbalance problem can improve prediction effects to a limited extent while us-ing mixed project data does not always improve the prediction performance of the HDP method;HDP methods are feasible in CPDP tasks where the source and target project have the same metric set.(2)Aiming at the shortcomings of the existing two types of HDP methods,we propose a joint metric selection and transformation(JMST)approach for HDP by taking advantage of the complementarity between the two metric processing methods.JMST employs the maximum mean discrepancy to reduce the distribution difference between the source and target data,and constructs a regularization term based on l2,1-norm to filter redundant information in the original metrics of the source and target projects.The joint optimization of metric selection and transformation achieves by designing an iterative algorithm.Experimental results on 22 projects verify the effectiveness of the proposed approach.(3)Given the poor interpretability of HDP methods and the class imbalance problem,we propose an aligned metric representation(AMR)based balanced mul-tiset ensemble learning(BMEL)approach.AMR consists of shared,source-specific,and target-specific metrics where each dimension has the actual meaning.It is built by learning the translation from shared metrics to specific ones and reducing the dis-tribution difference between projects.To deal with imbalanced data,we design BMEL that constructs multiple balanced subsets for source data and produces an aggregated classifier for predicting labels of target data.Experimental results on 22 projects show the effectiveness of BMEL+AMR and AMR.(4)To reduce the negative impact of noise modules on HDP with mixed project data while utilizing the underlying discriminant information in unlabeled modules,we propose a landmark-based domain adaptation and selective pseudo-labeling(LDASP)approach.LDASP highlights the importance of landmarks in defect prediction by assigning appropriate weights to source and target project modules while learning the transformation matrix that can efficiently map source data into the metric subspace of target data.It considers the reduction of marginal distribution differences and conditional distribution differences between the source and target data,as well as the preservation of the intra-class local structure of the source data.We also design a progressive pseudo-label selection strategy,and gradually introduce pseudo-labels with higher confidence and corresponding modules into the learning process of the transformation matrix.Experimental results on 27 projects verify the effectiveness of LDASP and each part. |