Font Size: a A A

Study On Cross-Project Defect Prediction Based On Transfer Learning

Posted on:2021-06-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L N GongFull Text:PDF
GTID:1488306464959779Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In order to allocate limited resources to high-risk modules,software defect prediction technologies rely on machine learning algorithms to mine and analyze historical software data,thereby discovering high-risk modules.However,in practical software development process,new projects are lack of historical dataset,which make traditional defect prediction technologies perform poorly.With the release of many open source project datasets,cross-project defect prediction technologies have been developed,and have been one of the hot research fields in software engineering.There are large different distributions between different projects.Transfer learning realizes the transfer of knowledge from the train instances to the target instances by relaxing the assumption that training data and target data must be the same distribution.Cross-project defect prediction based on transfer learning has attracted widespread attention from researchers.This paper has studied the most advanced methods and techniques in cross-project defect prediction(CPDP)and found that the existing CPDP methods still had key problems such as class imbalance,differences of conditional distribution and marginal probability distribution,heterogeneous problem,and class overlap.Based on these,this work has studied cross-project defect prediction methods based on transfer learning and proposed innovative and practical technologies to improve the prediction performance of the models.The specific contents were as follows:(1)An improved transfer adaptive boosting approach for mixed-project defect prediction.Considering the class-imbalance problem in the mixed-project defect prediction,an improved Tr Ada Boost(ITr Ada Boost)method was proposed through improving the the adjustment criteria of weights of misclassified instance and weak classifiers.Firstly,we improved the adjustment criteria of weights of misclassified instance in Tr Ada Boost.This criteria was not only based on whether instances are misclassified,but also consider the types of misclassified instances,and set different weights for misclassified defective instances and non-defective instances.Secondly,we improved the setting of the weight of the weak classifier,using matthews correlation coefficient(MCC)instead of accuracy as a measure to set the weight of weak classifier.A large number of experiments on 18 open-source projects from four datasets showed that the ITr Ada Boost method was not only better than other CPDP methods,but also could achieve the performance of the class imbalance prediction model in within-project context.(2)Combining stratification with nearest neighbor approach for strict homogeneity defect prediction.Considering the different distribution between projects,we proposed a strict homogeneity defect prediction model applying conditional distribution to reduce the differences,which was an iterative process.First,we obtained the pseudo-labels of software modules according to the prediction results in the previous iterative process by voting.Then we obtained the corresponding nearest-neighbor instances from the source projects according to the pseudo-labels and the number of instances in the corresponding class.Finally,we trained the classifier based on the obtained nearestneighbor instances,and so on.Experimental results show that compared with other methods,our method had higher AUC,Recall,and comparable pf and F-measure values.(3)Conditional domain adversarial adaptation for mixed-project heterogeneous defect prediction.Considering the label information among the source project and a small amount of label instances in target project,we proposed conditional domain adversarial adaptation(CDAA)method which included two processes(transferring knowledge from source project to target project and classifering).CDAA method had three neworks including generator,discriminator,and classifier.The generator mainly transfered source domain to the target domain space and learned the label information during transfering.Discriminator was mainly used to identify the target data and generated data.The classifier was mainly used to learn label information.A large number of experimental results showed that the CDAA method can take full advantage of the label information by CGAN to achieve the transference from the source project to the target project.Meanwhile,our CDAA method improved the performance of the heterogeneous defect prediction.(4)Unsupervised deep domain adaptation for strict heterogeneous defect prediction.Considering the problem of different metrics spaces between projects in strict heterogeneous defect prediction,we proposed an unsupervised depth domain adaptive method which introduced the deep transfer learning.In this method,we maped source project and testing project to a unified metric representation(UMR)which were the input of deep trainsfer learning.During training the deep learning,maximum mean difference(MMD)was used to measure the distribution difference between source and testing projects,and the cross-entropy loss function was used to measure the classification error.A large number of experimental results showed that our method can construct an effective prediction model for the heterogeneous defect prediction problem,and the prediction performance was improved.(5)The impact of class overlap on the performance of cross-project defect prediction models.In order to investigate the degree of impact of class overlap on the performance of defect prediction models,we used 28 open source projects as experimental objects,and empirically evaluated whether the neighbor cleaning method(NCL),K-Means cluster cleaning approach(KMCCA),and the improved K-Means cluster cleaning method(IKMCCA)could improve performance of state-of-the-art learning models.The experimental results show that after removing overlapping instances,the performance of state-of-the-art learning models has been improved in terms of balance,Recall,and AUC.Thus,considering class imbalance and class overlap problems at the same time is more conducive to performance improvement of learning models.
Keywords/Search Tags:Software Defect Prediction, Cross-project, Transfer learning, Class-imbalance, Deep learning
PDF Full Text Request
Related items