| Due to the outstanding performance of machine learning in software defect prediction(SDP),the defect prediction method based on machine learning has become the mainstream method of software defect prediction.However,with the raise of cross-project defect prediction(CPDP),the performance of traditional supervised learning methods cannot meet the requirements of this problem.This is because the researchers need to build a universal model and use source project data to predict the defect of the target project in this field.Meanwhile,the difference in the distribution of training data and test data will have an impact on the performance of classifiers.In the cross-project software defect prediction,the preprocessing to data distribution and training process can be transferred by the classifier are two key issues.The performance of prediction is directly related to extraction of appropriate feature space from the source project to the target project and training of excellent transferring model.Most researchers either only pay attention to the function of preprocessing,or only focus on the performance of the classifiers.However,they have overlooked the help of another method for the effective prediction.In addition,the traditional machine learning algorithm is a simple model with belonging to shallow learning.The ability of expression and generalization of these algorithms is limited when meeting complex function modeling.In order to solve the problem of cross-software defect prediction based on machine learning,this paper summarizes the existing research methods and research on the following aspects:(1)This paper put forward a distance-based baseline transformation method(BT).In order to reduce the variability of the data distribution in the source and target projects,the method first calculate the distance among non-defective samples,and then it finds a non-defective sample from each data set as the baseline for this data set based on the distance between the non-defective samples.Finally this method uses the rank function to transform data.The experimental results show that the data after baseline transformation can effectively predict the cross-project defect and achieve the performance in with-in project defect prediction.In addition,in comparison with the same type preprocessing method,the baseline transformation method has obviousadvantages in the cross-project defect prediction.(2)This paper presents a comprehensive prediction model based on preprocessing method and classifier transferring,which is used to construct better classification space and enhance the performance of classifier.Firstly,the baseline transformation method is used to preprocess the data,and then the genetic algorithm(GA)is selected as the transferring component,and the classification effect of different classifiers is used as the fitness degree,and the evolution is validated in the source project and the target project.Finally,the ensemble learning(EL)solves the problem that the features are single and the classifier expression is limited.The experimental results show that the comprehensive prediction model can effectively improve the predicting ability of classifiers.It is applicable to datasets of different magnitude with combining different classifiers.In comparison with other cross-project defect prediction methods,the comprehensive prediction model is generally superior to the accuracy and F-measure.(3)This paper proposes a way to find a baseline in a target project and a feature-based implementation of the baseline transformation.On one hand,according to the number and characteristics of the non-defective samples,the clustering algorithm and the baseline weighted average method are used to solve the problem that the sample label may be unknown in the target project.On the other hand,granularity of the baseline transformation method is analyzed,and the baseline transformation at the feature level further reduces the difference in the data distribution between different projects.The experimental results show that the clustering algorithm and the baseline weighted average method can approximate the baseline that is calculated when the sample marks in the target project are known.The performance on different classifiers(SVM,NB,CART)is realized by using the feature-based baseline transformation. |