Font Size: a A A

Software Fault Prediction Based On Machine Learning Approaches

Posted on:2020-03-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Chubato Wondaferaw YohanneseFull Text:PDF
GTID:1368330599475540Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Software testing resources are mostly limited,yet testing requires assessing a large number of software modules which is time-consuming and expensive.Moreover,traditional testing approaches are not sufficient for quality assurance due to the limited chances during software development.Thus,an automated technique for Software Fault Prediction(SFP)at the early development stage is now ubiquitous,to prioritize and optimize tests,make better use of resources and improve software quality.In this regard,Machine Learning(ML)has been employed.However,ML algorithms require good data quality for accurate SFP.A real-world dataset usually contains impurity.In SFP,a model is built from labeled instances and then used to predict the class for new previously unobserved instances.If the dataset used to train this model is corrupted,both the learning phase and the model obtained will be negatively affected.As a consequence,the final model will probably be less accurate.Thus,a solution to improve the quality is cleansing the defect dataset by exploring the possible problems and endeavoring to correct the problems.Given that,a detailed survey of the existing scientific literature in SFP suggests that the most commonly studied problem in this domain is related to classification.In particular situation,additional support plays a vital role in terms of resolving data quality challenges.Mainly,the presence of less important software metrics,class imbalance problem,class noise,not-useful instances,and outliers in defect datasets has created more and more challenges in SFP that compromise classification accuracy.Thus,it results in poor inference and inefficient resource management.Therefore,determining data quality and significant software metrics for which software faults can be predicted is still an open problem area.Thus,this research work aims to develop novel combined approaches to resolve those major challenges and enhance SFP.In this dissertation,the initial phase proposes a novel combined-learning based framework.This method independently examines software metrics with solving feature redundancy and irrelevancy using multiple Feature Selection(FS)techniques combined with solving class imbalance problem using Data Balancing(DB)by implementing Synthetic Minority Over-sampling Technique.Accordingly,a new framework that efficiently manages those challenges in a combined form on both Object-Oriented Metrics(OOM)and Static Code Metrics(SCM)is developed.In the experiment,Naive Bayes(NB),Neural Network(NN),Support Vector Machine(SVM),Random Forest(RF),K-Nearest Neighbor(k-NN),Decision Table(DTa),DecisionTree(DTr)and Random Tree(RTr)ML algorithms are employed.Receiver Operating Characteristics(ROC)curves is used as performance evaluation metrics.The experimental results confirm that the prediction performance could be compromised without suitable FS technique.To mitigate that,and obtain accurate and unbiased prediction,data must be balanced.Thus our proposed technique assures better performance.A combination of RF with Information Gain(IG)FS yields the highest ROC result,which is found to be the best combination when SCM is used,whereas the combination of RF with Correlation-based Feature Selection(CFS)guarantees the highest ROC result,which is found to be the best choice when OOM is used.The second phase further proposes a more comprehensive SFP framework called a threestage based ensemble learning.This research solves major SFP challenges that are created due to the presence of class noise.Then,it is further integrated with solving a large number of features and data skewness problems.In stage one,top features with the highest IG are selected,based on computing the difference between the average amounts of information required(entropy)and the expected information required to classify an instance by partitioning each feature.In stage two,focusing on Faulty Prone(FP)instances,the class distribution is handled by linear interpolation between randomly selected FP k-NN to produce a new FP instance.In stage three,Noise Filtering(NF)is performed based on the fusion of the predictions of classifiers used to detect the presence of noise following iterative NF schema.C4.5,3-NN and Logistic Regression are used for fusion based NF.Accordingly,a large scale comprehensive experiment is conducted.The performances of each stage are evaluated using 13 eminent Ensemble Learning Algorithms(ELA).A thorough and statistically sound comparison in each stage is presented using One-way ANOVA and Tukeys HSD test.The experimental results confirm the outperformance of our technique.Particularly high-performance results have achieved using ELA on important features of well-distributed data after removing noise instances.Moreover,noise filtering greatly decreases prediction errors and reduces computational cost in most cases,which are measured using Root Mean Squared Error and Elapsed Time Training,respectively.The last phase proposes a novel hybrid data reduction approach for improving SFP.This research solves three major SFP challenges that are created due to outliers,superfluous instances and features.Usually,defect datasets contain non-useful instances which can be noisy or redundant and hinder classification performance.Thus,the instance selection is carried out based on the estimation of the sample probability distribution in the neighborhood.Therefore,the probability that a given samples belongs to a class is the weighted average of the probabilitiesthat the sample under consideration k-NN belong to the same class.For outlier analysis,all features in the dataset are evaluated by measuring a data point that is far from the mean of the distribution,then for each instance in each feature the outlier score is determined.Therefore,If the instance score is greater than or equal to one,the respected instance will be considered as an outlier and discarded from the dataset.To select important features,the average value of all feature-classification correlations and all feature-feature correlations is considered.To achieve this,CFS is used in conjunction with best-first and evolutionary search methods.Accordingly,hybrid data reduction approaches are developed.The performances of approaches are evaluated using Bagging,RF,DTr,NB and DTa ML algorithms.Thorough and statistically sound comparisons are presented.The experimental results show the excellent performance achievements of our hybrid approach that removes outlier prior handles important features and then selects useful instances.Therefore,dealing with the challenges mentioned above,our proposed approach ensures enhanced SFP performance and lays the pathway to quality assurance.
Keywords/Search Tags:Software Fault Prediction, Machine Learning, Software Metrics, Feature Selection, Data Balancing, Noise Filtering
PDF Full Text Request
Related items