Font Size: a A A

Research On Approaches For Software Defect Prediction By Machine Learning

Posted on:2021-01-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:Kamal Bashir Elsiddig AbdelgadFull Text:PDF
GTID:1488306473472244Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Software testing resources are mostly limited,yet testing requires assessing a large number of software modules which is time-consuming and expensive.Moreover,traditional testing approaches are not sufficient for quality assurance due to the limited chances during software development.Thus,an automated technique for Software Defect Prediction(SDP)at the early development stage is now ubiquitous,to prioritize and optimize tests,make better use of resources and improve software quality.In this regard,Machine Learning(ML)has been employed.However,ML algorithms require good data quality for accurate SDP.A real-world dataset usually contains impurity.In SDP,a model is built from labeled instances and then used to predict the class for new previously unobserved instances.If the dataset used to train this model is corrupted,both the learning phase and the model obtained will be negatively affected.As a consequence,the final model will probably be less accurate.Thus,a solution to improve the quality is cleansing the defect dataset by exploring the possible problems and endeavoring to correct the problems.Given that,a detailed survey of the existing scientific literature in SDP suggests that the most commonly studied problem in this domain is related to classification.In a particular situation,additional support plays a vital role in terms of resolving data quality challenges.Mainly,the presence of less important software metrics,class imbalance,class noise and not-useful instances in defect datasets has created more and more challenges in SDP that compromise classification accuracy.These challenges ultimately result in poor inference and inefficient resource management.Therefore,determining data quality and relevant software metrics with which software defects can be correctly predicted is still an open problem area.Thus,this research aims to develop novel approaches to resolve those major challenges and enhance SDP.In this dissertation,the initial phase proposes a novel Feature Selection(FS)approach based on the Maximum Likelihood Logistic Regression(MLLR).This method preprocesses data by noise filtering techniques and imbalance treatment before the MLLR in order to select more relevant features.It includes the following three steps: 1)The Iterative Partitioning Filtering(IPF)is deployed on multiple iterations for class noise in defect datasets;2)The Synthetic Minority Over-sampling Technique(SMOTE)algorithm is used where new synthetic DefectProne(DP)instances are generated by a random linear interpolation between a DP instance and its nearest neighbor for the class imbalance;3)The MLLR approach is applied to identify important feature through the Wald test of statistical significance on the coefficients estimated for each feature at a specified confidence interval.The proposed approach is examined together with some well-known FS methods such as Chi-Square,Information Gain,Gain Ratio,Relief,and Symmetric Uncertainty(SU)in the case study of six selected software defect datasets.Based on the SDP prediction results of three different classifiers,the experimental findings suggest that the proposed approach is more effective than the ones compared.The second phase proposes a novel over-sampling technique called SMOTE-FRNF.This method solves class imbalance and the presence of noisy and borderline samples in software defect data.It integrates Fuzzy-Rough Instance Selection(FRIS),Iterative Noise Filter based on the Fusion of Classifiers(INFFC)and SMOTE oversampling to deal with the data imbalance issue as well as noisy samples.The algorithm starts by implementing SMOTE to generate synthetic examples by linear interpolation between randomly selected Defect Prone(DP)k-NN.Then the FRIS is employed to remove synthetic minority instances as well as original majority instances that have a small membership degree to the fuzzy positive region.Finally,INFFC is applied to clean the entire data.The application potential of the proposed method for SDP is tested on real-world datasets as well as artificial ones created by introducing different levels of noise into the real-world data.Through the SDP performance of different classifiers developed on datasets preprocessed by the various methods,it is found that our proposal is superior to the compared ones in all the performance indicators.The Wilcoxon signed-rank test validates the statistical significance of our study finding.The last phase proposes a novel integrated preprocessing framework in which different FS,Data Balance(DB),and Noise Filtering(NF)techniques are fused to deal with the factors that deteriorate learning performance in SDP studies.The scheme first deploys FS technique to handle feature redundancy and irrelevancy.Then samples of the selected feature set is balanced through several data balance technique to augment the minority class.Finally,the balanced data is filtered of noisy by different NF methods.The empirical findings captured in several performance indicators suggest that the proposed method is suitable and more prolific for enhancing the model performance than the compared ones in the literature.
Keywords/Search Tags:Software Defect Prediction, Machine Learning, Feature Selection, Data Balancing, Noise Filtering, Maximum-Likelihood Logistic Regression
PDF Full Text Request
Related items