Font Size: a A A

Research On Cross-Project Software Defect Prediction

Posted on:2021-05-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:C NiFull Text:PDF
GTID:1368330647450640Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Software defect prediction has become a hot research topic in software engineer-ing research domain,which aims to help prioritizing limited testing resource allocation by mining the historical data of software,extracting the feature data related to software defects,building an effective defect prediction model and classifying whether new soft-ware modules is defective or not.Generally,building a defect prediction model with good performance needs enough historical labeled data.However,in practice,it is d-ifficult to collect and label enough training data on two-folds:(1)for newly developed projects,there is usually only a small amount of historical labeled data;(2)labeling his-torical data needs a lot of manpower and material resources.Therefore,cross-project defect prediction(CPDP)emerges as the times require,which uses the historical data from other projects(i.e.,source projects)to build an effective defect prediction model and predicts whether the software module in the current project(i.e.,target project)contains defects.Although domestic and foreign researchers have proposed a variety of solutions for CPDP scenario,there are some problems that need to be solved urgently.In partic-ular,(1)in CPDP scenario,source project data and target project data come from dif-ferent distributions.Therefore,the defect prediction model built on the source projects might not generalize well to target project since the distribution difference;(2)cross-project defect prediction assumes source projects have plenty of labeled data while target project does not have.However,in practice,we might only have limited labeled data from both the source and target projects;(3)in the current research works,re-searchers pay more attention to how to improve the performance of defect prediction,but less attention to the impact of CPDP model on software developers in practical usage.In this paper,we proposed three effective solutions for the abovementioned key problems accordingly.The detailed works are summarised as follows:(1)For distribution difference in source projects and target project,this paper proposes a cluster-based method FeSCH(Feature Selection Using Clusters of Hybrid-Data),which alleviate the distribution differences by feature selection.FeSCH includes two phases:feature clustering phase and feature selection phase.The feature cluster-ing phase clusters features using a density-based clustering method,and the feature selection phase selects features from each cluster using a ranking strategy.FeSCH also designs three different heuristic ranking strategies in the second phase:local density of features,similarity of feature distributions,and feature-class relevance.To verify the effectiveness of the FeSCH,a comparison with benchmark methods on the widely used open-source software project is conducted.The experimental results show that FeSCH can effectively reduce the distribution difference between the source project and the target project,and improve the performance of defect prediction.(2)For both source projects and target project have limited labeled data,this pa-per proposes multi-task defect prediction method MASK,which extracts the common knowledge(i.e.,shared information)among related projects,and combines personal-ized information(i.e.,nonshared information)to build defect prediction model for all related projects simultaneously.In particular,MASK consists two phases:a differen-tial evolution optimization phase and a multi-task learning phase.The former phase aims to find optimal weights for shared and nonshared information in related projects(i.e.,the target project and its related source projects),while the latter phase builds prediction models for each project simultaneously.To verify the effectiveness of the MASK,a comparison with benchmark methods on the widely used open-source soft-ware project is conducted.The experimental results show that MASK can effectively extract appropriate shared information among related projects and combine the non-shared information to improve the performance of defect prediction.(3)For measuring the impact of defect prediction model on developers in CPDP scenario,this paper proposes a set of effort-aware performance measures and proposes an effort-aware supervised CPDP method(EASC).EASC can take both effort-aware and non-effort-aware performance measures into consideration,and uses appropriate strategy for defect prediction in different scenarios.In particular,when inspecting in-stances without considering inspection effort,a larger instance(e.g.,larger lines of code)should be first considered;when inspecting instances with considering inspec-tion effort,an instance with a larger ratio between each instance defect proneness(i.e.,a probability outputted by a classifier)and its inspection effort(i.e.,LOC)should be first considered.To verify the effectiveness of the EASC,a comparison with bench-mark methods on the widely used open-source software project is conducted.The experimental results show that EASC can achieve better performance when inspect-ing instances with or without considering inspection effort.We hence recommend that future studies should include EASC as the baseline method for comparison when de-veloping new CPDP methods.
Keywords/Search Tags:Cross-project Defect Prediction, Feature Selection, Multi-task Learning, Effort-aware Performance Measures
PDF Full Text Request
Related items