Font Size: a A A

Research And System Construction Of Data Preprocessing Mechanism In Software Defect Prediction

Posted on:2020-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:W ChenFull Text:PDF
GTID:2428330590996024Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology,a variety of computer software is widely used in all walks of life.During the process of software development and maintenance,it is inevitable to produce various defects.By analyzing historical defect and building software defect model,software defect prediction can recognize the latent defect prone software entities.There are some problems in software defect prediction that need to be solved,such as imbalanced data,dimensional disasters,etc.However,the current algorithms for imbalanced data processing are generally based on the K-Nearest Neighbor,which is computationally intensive and susceptible to noise data.The traditional feature selection algorithm can not effectively remove most of irrelevant features and redundant features.In this thesis,the imbalanced data processing technology and feature selection methods in software defect prediction are deeply researched and explored.The main research work of this thesis is as follows:(1)Summarize the current imbalanced data processing algorithms in software defect prediction,and analyze their advantages and disadvantages.Density-based spatial clustering of applications with noise(DBSCAN)is not rigorous when dealing with the samples near the borderline,we optimize the DBSCAN algorithm for this problem to make the clustering more reasonable.Combining the optimized DBSCAN and SMOTE,this thesis proposed a synthetic minority over-sampling technique based on density clustering.Firstly,we use the optimized DBSCAN divide the samples of minority class into three groups where are noise samples,core samples and borderline samples,then remove the noise samples of minority class,finally,the algorithm use different strategies to over-sample core samples and borderline samples.In empirical study,the algorithm is compared with several classical oversampling algorithms on the NASA software defect dataset.The results show that the algorithm can effectively solve the problem of data imbalance in software defect prediction.(2)Summarize the current feature selection algorithms in software defect prediction,and analyze their advantages and disadvantages.Aiming at the problem that the traditional feature selection algorithms can not effectively remove most of irrelevant features and redundant features.A cluster-based feature selection algorithm is proposed.Firstly,this algorithm use ReliefF algorithm to calculate the relevance between each feature and the target class,then sort the features to remove irrelevant features,after that,cluster the features according to the correlation between the remaining features,finally select the representative features of each cluster.The algorithm considers the correlation between features and the relevance between features and the target class,which can effectively remove redundant features and irrelevant features.In empirical study,we compare our method with classical feature selection algorithms on NASA software defect prediction datasets.The results show that the algorithm can effectively solve the dimensional disaster problem in software defect prediction.(3)Based on the above synthetic minority over-sampling algorithm and cluster-based feature selection algorithm,this thesis designs and constructs a software defect prediction data preprocessing system.The client includes upload module,oversampling module,feature selection module and algorithm comparison module.The server includes data analysis module,system algorithm module and algorithm comparison module.The system can oversample and select features of software defect prediction data sets,compare different algorithms,and clearly and accurately display the results of software defect prediction,which is beneficial to reduce the time and cost of software development and testing.
Keywords/Search Tags:software defect prediction, data preprocessing, imbalanced data, oversampling, feature selection
PDF Full Text Request
Related items