Font Size: a A A

Research On Key Technologies Of Data Preprocessing And The Application In Software Fault Prediction

Posted on:2018-07-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:W S LiuFull Text:PDF
GTID:1318330542967892Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
During the software development life cycle,the later the detection of the inher-ent faults in the software project under test is,the higher the cost of repairing the faults is.However,testing all the software modules will consume a large amount of manpower and resources,thus the project managers hope to pre-identify possible faulty modules in the software modules and allocate the sufficient test resources to these modules.Software fault prediction is a feasible method to solve the above problem.The general process is to extract the modules by mining software historical repositories and design the corresponding metrics(i.e.,features).Then by searching for the relevant information in the bug tracking system,we can label the software modules with faulty or non faulty,and create the software fault prediction dataset.A fault prediction model based on machine learning algorithm is built on the above constructed dataset.Finally,the new modules,which need to be predicted,will be classified as containing faulty or not.Software fault prediction is a hot research area in the field of mining software en-gineering data.Although domestic and foreign researchers have made some research achievements,there are still many problems that need to be solved.Specifically,(1)when measuring the software modules,the massive metrics will result in an issue of the curse of dimensionality,which causes different degree of redundancy between the metrics,or even some metrics merely provide no help to the construction of fault prediction model;(2)when collecting the fault prediction dataset,an insignificant mistake in measuring metrics or an incorrect record in labeling the software mod-ules will result in an issue of data noise in the dataset,which leads the empirical research to draw the wrong conclusions;(3)According to the past software develop-ment experience,the number of faulty modules is much smaller than that of the non faulty modules.The uneven distribution of faulty modules in the software project will result in an issue of the class imbalance,which may cause some faulty modules are predicted as the non faulty modules,therefore the enterprises have to pay the huge losses.For the three above software quality issues,this thesis puts forward the corresponding solutions.The main contributions can be summarized as follows:(1)For the curse of dimensionality issue,this thesis proposes a cluster analysis based feature selection method(CAFeS).This method can effectively remove the widely existed redundant features in the original feature set by feature clustering and feature ranking,and can still preserve the features,which high correlated to the label,in the new feature subset.In particular,original features are firstly clus-tered by K-Medoid clustering method based on the feature and feature correlation measure.Then for each cluster,features are ranked based on the feature and class relevance measure and a given number of features are chosen.In empirical studies,we choose symmetric uncertainty as the feature and feature correlation measure,and choose information gain,chi-square,or ReliefF as the feature and class rele-vance measure.Based on some real-world projects,such as Eclipse and NASA,we focus on the prediction performance after using CAFeS,and analyze the redundancy rate and selection proportion of selected feature subset.Final results show CAFeS can effectively alleviate the curse of dimensionality problem,and can improve the software fault prediction performance.(2)For the data noise issue,this thesis proposes a noise tolerable feature se-lection method(NtFCS)to relieve the impact of noise in the dataset.This method extends the cluster analysis based feature selection method(CAFeS).The main modifications are in the feature selection stage.Unlike CAFeS,to reduce the prob-ability of the noise feature to be selected,NtFCS selects a most typical feature from each cluster instead of many,and we also propose three heuristic feature selection strategies either based on the feature and feature correlation measure or the feature and class relevance measure.The former may be sensitive to class noise,and the lat-ter may be sensitive to feature noise,thus the proposed feature selection strategies can make up for each other.During empirical studies,we choose Eclipse and NASA as test subjects.We first perform a set of data preprocessing steps to improve the quality of these datasets,and also inject class and feature noises simultaneously to imitate the real noisy datasets.The experimental results confirm NtFCS can effec-tively alleviate the data noise problem and provide a guideline of using NtFCS after analyzing the effects of varying either percentage of selected features or the noise injection rates,and different noise types.(3)For the class imbalance issue,this thesis proposes a two-stage data prepro-cessing method which incorporates both feature selection and instance reduction.Concretely,in the feature selection stage,different to CAFeS,this two-stage method first performs relevance analysis,and then conducts redundancy control.The pur-pose of the reverse order between feature clustering and feature ranking is to remove the redundant features,while maximize the correlation between the feature and the class.Specifically,we propose a threshold-based clustering method(NTC),which effectively avoids the limitation of the pre-defined number clusters in K-Medoid.In the instance reduction stage,we apply random under-sampling to keep the bal-ance between the faulty and non faulty instances.In empirical studies,Eclipse and NASA are still chosen as the experiment datasets.Then we compare our method with some classical baseline methods,and further investigate the influencing factors in our method.The final results demonstrate the effectiveness of NTC,and can achieve an extra promotion combined with instance reduction.
Keywords/Search Tags:Software Fault Prediction, Curse of Dimensionality, Data Noise, Class Imbalance
PDF Full Text Request
Related items