Font Size: a A A

Research On Cross-project Software Defect Prediction By Transfer Learning

Posted on:2020-12-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:S J QiuFull Text:PDF
GTID:1368330620958601Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of software application,its diversity and complexity are continuously increasing,and correspondingly,higher requirements have been placed on software quality assurance technologies.Software test and defect prediction technology play an important role in the software life cycle.If software defects can be automatically detected during software development and test stages,it will help the quality assurance team to discover potential issues and understand the defect distribution.Then,the test resources can be allocated reasonably,which will improve software quality and save the test cost.In recent years,researchers have been exploring the technology for software defect prediction.Some successes have been achieved with the use of machine learning methodologies.However,the problems of software cold start and the scarcity of labeled data remain unresolved.To enable software defect prediction technology to be applied in the software lifecycle earlier,cross-project software defect prediction technology is proposed as a feasible and effective solution.The cross-project software defect prediction aims to transfer the defect prediction knowledge from a mature project(the source project with sufficient labeled data)to a target project with no or very limited labeled data,so that the defect prediction model trained by source project can be used to detect defects of the target project.However,problems still exist in the cross-project software defect prediction technology and affect the model prediction performance.This thesis focuses on four problems of data class imbalance,data negative transfer,distribution under-adaptation,and transferable semantic feature missing in cross-project software defect prediction.At first,an investigation is carried out around the class imbalance problem.Based on the class-balanced data,the corresponding solutions are designed for the problems of data negative transfer,distribution under-adaptation and transferable semantic feature missing by implementing the enhanced transfer leaning algorithm.The main research work includes:(1)Study the class imbalance problem and the class imbalance learning methods in crossproject software defect prediction.Most cross-project software defect prediction studies usually adopt the sub-sampling and cost-sensitive learning methods.In this thesis,we expand the scope of investigation on class imbalanced learning methods to provide the selection suggestions.Based on 31 open source projects in 5 software defect data repositories,large-scale experiments were carried out.By introducing a statistical analysis method,we analyze the 15 kinds of class imbalanced learning methods and 37,504 prediction results,then evaluate the effectiveness of each method under different data sets and base classifiers,providing a basis for data class imbalance processing for further research in this thesis.(2)Explore the negative transfer problem caused by irrelevant data in cross-project software defect prediction.Most of current studies only consider the relationship between the crossproject instances and focus on addressing non-independent instances of the target project,while ignoring the different auxiliary performance of the clusters in the source project for the target project.In this thesis,a method based on multiple clusters weight analysis is proposed,which utilizes the small ratio of within-project labeled data to evaluate the auxiliary ability of each cluster in the source project for the defect prediction task of the target project.By combining the kernel mean matching algorithm with the multiple cluster weight learning,the method adjusts the source instance weights and source cluster weights to alleviate the negative impact of irrelevant data in the cross-project software defect prediction task.(3)Explore the problem of the distribution discrepancy across projects affecting the performance of cross-project software defect prediction model.At present,most of the distribution adaptation methods only consider the data marginal probability distribution,however,the conditional probability distribution adaptation is not fully considered,so there is a problem that the probability distribution of data between projects is under-adapted.To solve this problem,a joint distribution matching algorithm is proposed.The algorithm introduces the maximum mean discrepancy to measure the distance between joint probability distributions.It aimis at improving the knowledge transfer ability of defect prediction model by simultaneously minimizing the discrepancy of data marginal and conditional probability distributions across projects with transductive transfer learning.(4)Explore the problem of missing transferable semantic features in cross-project software defect prediction.Because of the differences in the distributions of data between projects,the semantic features of source project extracted by deep learning in existing researches are often not effectively applied to the defect prediction task of target project.Aiming at the problem of missing transferable semantic features between projects,this thesis proposes a transfer convolutional neural network model.The model parses the source code of the program into an integer vector as the input of the neural network,and adds the data distribution matching layer to the neural network.By simultaneously minimizing the classification errors,distribution discrepancy and manifold regularization item,the model is able to extract the transferable deep learning-generated features and apply them to improve cross-project software defect prediction performance.
Keywords/Search Tags:Software Defect Prediction, Cross-Project Software Defect Prediction, Machine Learning, Transfer learning, Maximum Mean Discrepancy
PDF Full Text Request
Related items