Font Size: a A A

Research On Machine Learning Based Software Defect Prediction

Posted on:2019-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z W ZhangFull Text:PDF
GTID:1368330590496080Subject:Information security
Abstract/Summary:PDF Full Text Request
With the development of information technology,the challenges of information security are becoming more and more serious,among which software security attracts more and more attention.As one of the main sources of software security risks,software defect can not only threaten the security and stability of computer information systems,but also may be exploited by hackers and lead to malicious intrusions.Software defect prediction is an important research topic in the domain of software engineering,which is closely related to software quality assurance,and it can improve the quality of software system and optimize the allocation of testing resources.In the domain of software engineering data mining,based on the software historical repositories data,machine learning based static software defect prediction employs defect metrics to analyze software code or development process,and uses machine learning methods to predict the defect-proneness or number of software modules in software projects.The key factors influencing the performance of defect prediction are the design of the metrics,the construction of the defect prediction model and the correlation processing of the defect prediction datasets.From the perspective of machine learning,this dissertation carries out a systematic research on the methods of software defect prediction.The main research achievements are summarized as follows:?1?Collaborative representation based feature selection and classification of software defect dataA filter-type software defect data feature selection method based on collaborative representation score?CRS?and a software defect prediction classification method based on collaborative representation?CSDP?are proposed.In order to eliminate the redundancy between metrics in software defect data sets and improve the computational efficiency of feature selection methods,CRS uses collaborative representation and l2 graph to construct the adjacency relation and weight of defect data graph,and the collaborative representation preserving capability and the data variance are taken into account.CRS sorts the collaborative representation score and iteratively selects the data features.In CSDP,Laplace score sampling is performed on the defect-free training dataset to construct a class-balance training dataset firstly,and then the projection matrix of the query sample is obtained by using collaborative representation.Finally,the software defects predictor is built using collaborative representation classification with regularized least square.CSDP uses l2 norm regularization to replace 1l norm regularization,which further enhances the efficiency of sparse classification and reduces the computational complexity.The collaborative representation classification in CSDP not only uses the discriminative ability of class-specific representation residuals,but also uses the l2 norm“sparse term”discriminative information,so the classification performance is improved.Experimental results on the widely used software defect prediction datasets demonstrate the effectiveness of the two methods.?2?Software defect prediction based on dictionary learningSoftware defect prediction based on prediction defect proneness can be regarded as a binary classification problem.In view of the cost sensitivity and class imbalance characteristics of software defect prediction,incremental learning and semi-supervised scene,three dictionary learning software defect prediction methods are proposed,including cost-sensitive discriminative dictionary learning?CDDL?,class-specific incremental dictionary learning?CIDL?and twice learning based semi-supervised dictionary learning?TLSDL?.In the construction of the initial dictionary atoms,CDDL solves the class-imbalance problem by PCA technique.Using the similarity among software modules,CDDL designs a discriminative dictionary.Considering the risk cost of software defect prediction,cost sensitive learning is introduced into the discriminative dictionary learning based defect prediction model.In order to solve the problem of high computational cost of traditional dictionary learning method in large datasets,CIDL is designed as an incremental dictionary learning method.CIDL performs class-specific dictionary learning on the initial training set,and the learned supervised dictionaries are helpful for classification prediction.Using the principle of maximum mutual information to select the incremental dictionary atoms in the incremental set,CIDL makes full use of the complementary information of the incremental data.In order to solve the problem of constructing an effective defect dictionary learning model in semi-supervised scene with less labeled samples and abundant unlabeled samples,TLSDL uses the twice learning framework.In the first stage of learning,a large number of unlabeled samples are extended to the labeled training samples by means of probabilistic soft labels.In the second stage of learning,the performance of sparse representation classification and prediction can be improved by discriminative dictionary learning.Experimental results demonstrate the effectiveness of the three methods.?3?Semi-supervised software defect prediction based on graph learningIn order to fully represent the potential clustering relationships among software defect data,a sparse,high discriminative-power,and adaptive-neighborhood informative graph is proposed,i.e.,non-negative sparse graph?NSG?.In the sparse graph learning process,NSG adds nonnegative constraints,and the connection relationships and weights of nodes on the graph are obtained simultaneously in non-negative sparse coding.On NSG,two semi-supervised software defect prediction methods are proposed,including NSG based co-training?NSGCT?and NSG based label propagation?NSGLP?.NSGCT combines with the advantages of the graph-based method and the disagreement-based method.It explicitly estimates the confidence of unlabeled data to reduce the introduction of noise data,and improves the performance of semi-supervised co-training algorithm.NSGLP uses Laplace score sampling technique for imbalance processing and uses NSG to represent the relationship between defect data.On NSG,a label propagation algorithm is used to iteratively predict the labels of unlabeled software modules.NSGLP improves the performance of semi-supervised defect prediction by imbalanced processing and construction of information graphs.Experimental results demonstrate the effectiveness of the two graph-based semi-supervised methods.
Keywords/Search Tags:software defect prediction, dictionary learning, semi-supervised learning, feature selection, collaborative representation
PDF Full Text Request
Related items