Software defect prediction(SDP)aims to predict defect-prone modules via building a defect prediction model trained with empirical software defect data.Applying the model to predict whether a software module tends to contain faults can guide the quality assurance team to reasonably allocate limited testing resources.However,it is very time-consuming and laborious to obtain and label the historical data.There is insufficient historical data,especially in the early stage of the software life cycle.Therefore,cross-project defect prediction(CPDP)came into being.CPDP leverages the knowledge from external open-source software projects with rich labeling(source software projects)to facilitate tasks in the current label-poor software project(target software project),thus improving the defect prediction ability of the model for current software projects.Although the CPDP techniques have recently received much attention,there are still some challenges.(1)How to simultaneously improve the transferability between software projects and the discriminability of different categories;(2)When the probability distribution is matched,the pseudo labels of the target project deviate from the real labels,which would result in inferior results the predictive performance;(3)When facing multiple source projects,how to measure the amount of knowledge transferred from each source project to the target task;(4)How to avoid the negative transfer problem caused by the transfer of irrelevant knowledge.To address the above-described four challenges and further improve the performance of defect prediction models,this thesis proposes four different CPDP methods based on domain adaptation.The specific research contents are as follows:(1)Previous research only concentrated on reducing the distribution difference between the source and target projects to realize the transferability between them.It laid inadequate emphasis on the discriminability between defective and non-defective categories.To learn the feature representation with both transferability and discriminability,this thesis presents a discriminative domain adaptation method.From the perspective of the probability distribution,the marginal distribution and conditional distribution differences between two projects are reduced by joint distribution matching so as to realize the transferability between different projects;From the perspective of instance relationship,the graph embedding is embedded in the adaptation process so that the feature space after adaptation can meet the criterion of ”compact within a class,separate between classes”.(2)Existing CPDP methods mainly consider learning the global feature representation while ignoring the local feature representation between instances from different projects within the same category.This results in unsatisfying transfer learning performance without capturing the fine-grained information.In addition,CPDP methods based on pseudo-labels assume that the conditional distribution can be well matched at one stroke when instances of the target project are correctly annotated pseudo labels.However,due to the large gap between projects,the pseudolabels seriously deviate from the real labels.Therefore,this thesis proposes a novel joint feature representation with double marginalized denoising autoencoders(DMDA-JFR).We utilize two novel autoencoders to jointly learn the global and local feature representations simultaneously.To achieve progressive distribution matching,we introduce a repetitious pseudo-labels strategy,which makes it possible that distributions are matched after each stack layer learning rather than in one stroke.(3)Previous studies have shown that the impacts of knowledge transferred from different source projects affect the target task differently.Therefore,one of the fundamental challenges in CPDP is how to measure the amount of knowledge transferred from each source project to the target task.This article proposed a novel CPDP method called Multi-source defect prediction with Joint Wasserstein Distance and Ensemble Learning(MJWDEL)to learn transferred weights for evaluating the importance of the source project to the target task.In particular,we apply the TCA technique to train a sub-model for each source and target project.Meanwhile,we design joint Wasserstein distance to understand the source-target relationship and then use this as a basis to compute the transferred weights of different sub-models.After that,the transferred weights can be used to reweight these sub-models to determine their importance in knowledge transfer to the target task.And a final ensemble model is formed.(4)How to avoid the negative transfer problem caused by irrelevant knowledge? Most of the existing CPDPs are based on the feature layer or instance layer for knowledge transfer,weakening the impact of irrelevant knowledge in essence.To further strengthen knowledge transfer between cross projects,this paper proposes a dual weighting mechanism to weight data at feature level and instance level.When assigning weights to the features of the source project,the features that are highly correlated with the learning task,uncorrelated with other features,and minimizing the difference between the target project are assigned with a higher feature weight.When assigning weights to source project instances,the local data gravity considering the density of data distribution is proposed.The instances are reweighted based on local data gravity.Finally,the feature and instance weights are embedded into Bayesian classifier to form a defect prediction model. |