With the rapid development of network technology,users have put forward higher requirements for the quality of software.Software defects are the biggest factor affecting software quality.The detection of software defects becomes a necessary step before the software goes online.However,usually the size of a software is very large,and if you want to check the entire software code,the cost is huge.The software defect prediction is proposed to solve this problem,and the human and material resources to be detected are better allocated to possible defects,which can improve efficiency and save costs.In this paper,we focus on the insufficiency of feature relevance and the problem of label less prediction in software defect prediction based on machine learning.We start from several angles to improve the Area Under Curve(AUC)index,compared with the accuracy,better express the predictive effect of fewer defective classes.The main results of the thesis are:First of all,for the imbalance of classification in software defect prediction,that is,the defective part compared with the normal code part is often a minority,and a new oversampling scheme is proposed.The intra-class dispersion information and support vector cleaning strategy are added in the scheme to make the new sample distribution more uniform.Compared with the machine learning methods popular in multiple software defect predictions with multiple oversampling schemes,it is proved that the proposed oversampling scheme achieves a higher AUC.Secondly,starting from the features,the original features are screened according to the influence degree of each feature on the final prediction effect,and a scheme of features screening and forecasting are proposed.The scheme selects the features from the positive increasing feature and the reverse decreasing feature respectively,and removes the noise feature to complete the feature dimension reduction.The experimental results show that the proposed scheme is superior to the feature reduction prediction method proposed by the predecessors,and it is better in AUC and time complexity.Finally,considering that many software does not have multiple versions of iterations,that is,no labeled data,which requires unsupervised learning to solve.In this paper,an automatic labeling scheme based on main features is proposed.The grouped features are mapped to low-dimensional feature spaces and clustering are used.Finally,experimental verification is performed on multiple datasets.The results show that the AUC of the scheme is improved. |