Font Size: a A A

A Study On Software Defects Prediction Based On Machine Learning

Posted on:2017-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:1318330503982904Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Software testing is an important and indispensable stage to build the high quality software and ensure the reliability in software development process. However, as the size and complexity of current software keep an upward tendency, the costs of detecting and fixing defects hiding in software are also increasing rapidly. Software defects prediction technology can help the tester locate the defective-prone modules early, so that it can guide decision maker to allocate scarce resources to detect defective-prone modules superiorly, which is useful to improve the software quality and save a lot of time and budgets.This thesis aims at discussing and researching methods of software defects prediction based on machine learning. Although a lot of defects prediction models with various machine learning algorithms have been proposed by previous studies, there are still some practical issues when using the current models:(1) The conventional prediction models could not be trained effectively as lack of sufficient training data in the early stages of software testing. Meanwhile, labeling defects samples is also time-consuming and costly.(2) Defects datasets have the character of class imbalance, which could make the prediction model generate the bias test results toward non-defective instances. Moreover, class overlapping is also existed in imbalanced datasets, which lead the classifier tends to ignore the minority defective instances, so that it enhances the impact of class imbalance.(3) When conducting defects predictionfor some extremely imbalanced datasets, the rare number of defective modules in software could lead lack of defective training samples toconstruct an effective prediction model.Motivated by the analysis and the research on the questions above, this thesis proposed some novel software defects prediction models and obtained some innovative conclusions listed below:(1) To tackle the problem of the lack of labeled training samples in early life of software testing, a cross-company defects prediction(CCDP) model based on transfer boosting is proposed by using labeled defects data from other companies. Firstly, the cross-company data are re-weighted according to the degree of similarity of attributes to within-company samples using data gravity method. And then, the prediction model is built based on the transfer boosting, which employs a small ratio of labeled within-company data to eliminate the cross-company instances which conflicted with CCDP. The experimental results and statistical analyses show that: the proposed method presents the best overall performance among all tested CCDP models. Comparing with traditional prediction methods confined to within-company defects prediction(WCDP), the proposed model can perform significantly better than WCDP models trained by limited samples, and also can be comparable to WCDP models with sufficient training data.(2) To handle the class imbalance issues widely exist in defects datasets, a novel defects prediction model is proposed based on class overlap and imbalance learning method. Firstly, for the class overlap problem in imbalanced datasets, overlapping non-defective samples are removed based on proposedneighbor cleaning rules, so that the classifier can learn the defective samples more effectively; And then, for the issue of imbalanced number of samples between classes, the imbalanced dataset is divided into some small subsets with balanced samples by using random under-sampling. Multiple sub-classifiers can be trained with those balanced subsets. The final classifier can be generated based on ensemble learning mechanism. Contrast experiments are conducted on 9 imbalanced datasets selected from public software repository, and the test results indicate that: comparing with existing conventional defects prediction method and other class imbalanced learning algorithms, the proposed models can achieve the higher defects prediction rate, and also generates the best overall comprehensive prediction performance.(3) To solve the problem that the defective samples are too rare to train the classifier in some extremely imbalanced datasets, firstly, an one-class defects prediction model is proposed by using only non-defective samples based on one-class SVM. The empirical study is conducted on 6 extremely imbalanced datasets. The test results suggest that, the one-class model can achieve a relatively higher defects prediction rate and comprehensive performance using only a small proportion of non-defective samples, and also performs better than tested conventional defects prediction models and class imbalanced learning models on most of test datasets. Secondly, a dynamic selection ensemble strategy is proposed to improve the one-class SVM, and the experimental results suggest that the proposed method effectively increased the defects prediction rate of one-class model along with the comparable overall performance, so that it can further enhance the effect of one-class model in practical defects prediction.This thesis brings a new prospect of software defects prediction technology, and also raises the effectiveness and applicability of defects prediction model applied in different software defects datasets, and provides some viable solutions for software engineers using defects prediction technology into the practical software testing.
Keywords/Search Tags:software defects prediction, cross-company defects prediction, class imbalance defects prediction, machine learning
PDF Full Text Request
Related items