Font Size: a A A

Research On Heterogeneous Software Defect Prediction

Posted on:2019-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Q LiFull Text:PDF
GTID:1368330545499890Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,software defect prediction(SDP)has attracted a lot of research interest in the software engineering community.The machine learning based new techniques and methods have become the research focus of defect prediction.SDP refers to building prediction model based on the historical defect data to predict the potential defects of software modules.The research on SDP can optimize the allocation of testing resources and improve the quality of software.Recently,researchers have proposed heterogeneous defect prediction(HDP),which refers to predicting defect-proneness of software modules in a target project using heterogeneous metric data from other source projects.HDP solves the problem that the source and target projects have different metric sets,that is,the metric type or metric set size is different.Although the existing HDP methods have tackled a number of difficulties and achieved interesting prediction results,there still exist many practical problems to be studied in HDP.This dissertation focuses on the research of HDP,especially to address the problems by using new techniques and methods based on machine learning.(1)Defect data consists of different types of metrics.These different metrics usually have different physical meanings and distributions,leading to the fact that the defect data usually lie on a nonlinear feature space.Therefore,defect data has linearly inseparable problem.(2)Defect data is usually highly imbalanced.That is,the data contains much more non-defective modules(majority)than defective ones(minority).The imbalanced distribution could cause misclassification of the modules in the minority class,and this is an important factor accounting for the unsatisfactory prediction performance.(3)Because the source and target data have different software metrics,they are usually heterogeneous.However,how to make use of the source and a small number of target labeled data(i.e.,mixed project data)simultaneously to learn defect predictor in HDP has not been studied.(4)Existing HDP methods mainly focus on predicting software modules in a target project based on metric data collected from a single project.Multiple source projects can generally provide more information than a single project.Intuitively,predicting software modules in a target project with multiple sources may bring better performance.Therefore,it is meaningful to investigate whether and how the performance of HDP can be improved by employing the multiple source projects.(5)A precondition to conduct HDP learning experiments with multiple source projects is that these projects can be obtained from other companies.In practice,due to the privacy concerns,most companies are not willing to share their data.To facilitate data sharing,how to protect the privacy of data owners before they release their data is a very important and urgent research work.In this dissertation,we studied the problems including linearly inseparable,class imbalance,mixed project data,multiple sources and privacy preservation that existed in HDP,and achieved some valuable results in terms of the proposed HDP methods:(1)To solve the problems of linearly inseparable and class imbalance simultaneously,we propose a cost-sensitive transfer kernel canonical correlation analysis(CTKCCA)approach for HDP.Specifically,for the linearly inseparable problem,CTKCCA transforms the source and target data in a common nonlinear feature space by using transfer kernel canonical correlation analysis technique.In this space,the data distribution of source and target are much more similar,defective and non-defective modules can be better separated.For the class imbalance problem,CTKCCA assigns different misclassification costs for defective and non-defective modules of the source in the transfer learning stage with the utilization of cost-sensitive learning technique.By combining the kernel CCA and cost-sensitive learning techniques effectively,CTKCCA can improve the performance of model.Extensive experimental results on 28 projects show the effectiveness of the proposed CTKCCA approach.(2)To solve the problems of mixed project data and class imbalance simultaneously,we propose a cost-sensitive label-and-structure-consistent unilateral projection(CLSUP)approach for HDP.Specifically,to use the source project data and a small amount of labeled data in the target project effectively,CLSUP utilizes domain adaptation learning technique to transform the source data to the target subspace,where the data distributions of source and target projects become similar and the structure of source data can be maintained.To mitigate the influence of the class imbalance problem,CLSUP utilizes different misclassification costs for defective and non-defective modules in the domain adaptation learning stage by leveraging the cost-sensitive learning technique.Extensive experimental results on 30 projects show the effectiveness of our CLSUP approach.(3)To solve the problems of privacy preservation and multiple sources simultaneously,we propose a multi-source and privacy preservation based heterogeneous defect prediction framework.Specifically,to protect the privacy of source data,we design a sparse representation based double obfuscation(SRDO).For a given module,SRDO utilizes sparse representation based nearest neighbor selector to select a defective and a non-defective module as disturbances for obfuscating the current module.To make use of multiple data sources effectively,based on the obfuscated data,we develop a multi-source selection based manifold discriminant alignment(MSMDA)approach for HDP.For a given target project,MSMDA can incrementally select distribution-similar source projects from many available sources.Finally,we can use the well-selected projects to carry out multi-source heterogeneous defect prediction.Extensive experimental results on 28 projects show the effectiveness of the privatization algorithm SRDO and the multi-source heterogeneous defect prediction approach MSMDA.
Keywords/Search Tags:Heterogeneous Defect Prediction, Linearly Inseparable, Class Imbalance, Mixed Projects, Privacy Preservation, Multiple Sources, Machine Learning
PDF Full Text Request
Related items