Font Size: a A A

Research On Machine Learning Based Software Defect Prediction

Posted on:2013-01-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y MaFull Text:PDF
GTID:1118330374486908Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Developing a good and stable software defect preditor has become a domesticand international scientific frontier, and attracts more and more attention of softwareindustry. We make an analysis of software defect prediction problem from a machinelearning perspective, where software characteristics are represented with static codefeatures and defect predictors are learned from historical defect logs. We observe thatthe present existing defect predictors have reached a limit performance, due to threemain problems in the application:(1) Data distributions are different among data sets,which come from different projects or different domains. Therefore, the predictor builtwith traditional method has weak adaptive ability.(2) Manual labeling defectivemodules is both costly and time consuming, so the positive data is limited. Recentapproaches are based on supervised learning algorithms, which learn predictors withlabeled data only. These methods cannot satisfy the requirements, since the limitedlabeled data are not enough to leaning predictors.(3) The software defect data arealways class imbalanced data where the number of positive examples is much higherthan that of others. Class imbalance problem has greatly influenced the performance ofthe defect predictor. In the thesis, we survey the state of the art in software defectprediction research, including motivations, progress, characteristics, and disadvantages.This thesis presents innovative and practical techniques for addressing the threeproblems mentioned above in software defect prediction as follows:(1) Research on predictive model on transfer learning methodUnlike the prior works selecting training data which are similar from the test data,we propose a novel algorithm called Transfer Naive Bays (TNB), by using theinformation of all the proper features in training data. Our solution estimates thedistribution of the test data, and transfers cross-company data information into theweights of the training data. On these weighted data, the defect prediction model is built.We also present a theoretical analysis for the comparative methods, and show theexperiment results on the data sets from different organizations. It indicates that TNB is more accurate in terms of AUC, within less runtime than the state of the art methods.(2) Research on predictive model on semi-supervised learning methodWe present an improved semi-supervised learning approach for defect predictioninvolving class imbalanced and limited labeled data problem. This approach employsrandom undersampling technique to resample the original training set and updatingtraining set in each round for co-train style algorithm. It makes the defect predictormore practical for real applications, by combating these problems. In comparison withconventional machine learning approaches, our method has significant superiorperformance. Experimental results also show that with the proposed learning approach,it is possible to design better method to tackle the class imbalanced problem insemi-supervised learning.(3) Active learning for software defect predictionWe introduce active learning strategies into the defect prediction. An activelearning method, called Two-stage Active Learning algorithm (TAL), is developed forsoftware defect prediction. Combining the clustering and support vector machinetechniques, this method improves the performance of the predictor with less labelingeffort. The experiments validate its effectiveness.(4) Kernel based asymmetric learning for software defect predictionA kernel based asymmetric learning method is developed for software defectprediction. Kernel method can deal with nonlinearly separable classification problemeffectively. We also analyse the effect of class imbalance problem on kernel principalcomponent analysis. The proposed method Asymmetric Kernel Principal ComponentClassification (AKPCC) improves the performance of the predictor on class imbalanceddata, since it is retrieve the loss caused by class imbalance problem, based on kernelprincipal component analysis. This method has better F-measure performance than otherwell-known methods.
Keywords/Search Tags:Software Defect Prediction, Machine Learning, Transfer Learning, ActiveLearning, Semi-supervised Learning
PDF Full Text Request
Related items