Font Size: a A A

Research On High-dimensional Data Processing In Software Defect Prediction

Posted on:2021-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:R LiFull Text:PDF
GTID:2428330611988267Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The scale and complexity of current software are increasing day by day,so software reliability is of great concern.In software engineering,if it is possible to find the modules and their distribution that may have defects in the software system,it plays an important role in the software developer's rational allocation of resources and the improvement of software quality.Software defect prediction(SDP)technology is to predict whether there are defects in software modules,and based on historical data and software metrics such as defects that have been found,predict which modules are prone to errors.Reasonable prediction of software defects can effectively help testers quickly locate and make up for software defects,thereby achieving the effect of significantly reducing software development costs and improving software credibility.Current research usually formalizes the implementation of defect prediction as a machine learning problem,and many machine learning techniques are used for defect prediction.However,the existing defect prediction methods still have many problems in practical applications.For example,the performance of these methods is not stable enough.In the case of high-dimensional data(such as a large number of redundant and irrelevant measurement elements),the prediction accuracy is not high,and high-dimensional data is very common in practical applications.In addition,because the defective class(also called "positive class")is usually much less than the non-defective class(also called "negative class"),that is,the historical defect data has class imbalance,which is easy to cause the prediction model to prefer the negative class,thereby reducing the prediction accuracy of the positive class.Due to the limited classification ability ofsingle classifiers,it can not effectively deal with imbalanced data.Therefore,many scholars use ensemble learning methods to predict defects.This thesis systematically studies the problems of high dimensionality and class imbalance in software defect prediction.First,in order to deal with high-dimensional and imbalanced data in defect prediction,we conducted a comparative study on the application effects of existing oversampling methods and feature selection methods in defect prediction;Second,the concepts of rough set theory and knowledge granularity are introduced into feature selection,and a new information entropy model—harmonic granularity decision entropy is proposed,and a feature selection algorithm FSHGE based on harmonic granularity decision entropy is constructed from this;Third,for the problem that the single classifier has limited classification ability and poor defect prediction effect,we propose a multi-modal selective ensemble learning algorithm SE_RSFS,and use SE_RSFS for defect prediction.The SE_RSFS algorithm uses the previously proposed feature selection algorithm FSHGE and resampling technology to simultaneously disturb the attribute space and sample space of the training set,thereby achieving an efficient multi-modal disturbance.
Keywords/Search Tags:software defect prediction,SDP, feature selection, ensemble learning, rough sets, class imbalance, harmonic granularity decision entropy
PDF Full Text Request
Related items