Software defects are the antithesis of software quality and threaten software security.The residual defects will generate more and more thorny problems with the software iteration,and once the defects break out,it will cause unpredictable disastrous consequences.Therefore,software defects must be detected and repaired.It has long been proved that the earlier the defect is found,the lower the repair cost and the more losses will be recovered.However,with the continuous development of the software industry,the scale and structure of the program also become larger and more complex,which makes the defects hidden deeper and harder to be found.Therefore,how to detect software defects as soon as possible and then repair them at a lower cost has become an urgent scientific problem to be solved.To solve this problem,scholars have proposed a software defect prediction method based on machine learning,in order to find software defects as soon as possible.However,these methods are difficult to overcome the problems of high dimensionality of defect data,insufficient labeled samples,unbalanced classification and too coarse prediction grain,which seriously restricts the improvement of prediction efficiency and accuracy.In order to solve the above problems,this paper proposes a semi supervised software defect prediction and localization method.The main contributions and innovations are as follows:(1)To address the problem of high feature dimension of defect data,which affects the classification accuracy of prediction model,this paper proposes a filtering feature selection method based on correlation and redundancy.The method includes two stages,the first stage calculates the correlation of features,the second stage calculates the redundancy,and combined with the previous correlation ranking,selects the optimal feature subset.The innovation of this method is that the ranking sequence of the three features is combined,and each feature is given weight,so as to effectively improve the generalization ability of the prediction model and avoid the instability of a single feature selection method.At the same time,considering the correlation between features,it can effectively eliminate redundant features and reduce the feature dimension.Experimental results show that this method can better improve the classification accuracy of software defect prediction model compared with other filtering feature selection methods.(2)To address the problem of insufficient sample of defect markers and unbalanced classification leading to difficulties in predicting defects in the early stages of software development,this paper proposes a semi supervised software defect prediction model based on tritraining.Firstly,the feature normalization method is used to smooth the feature data to eliminate the impact of too large or too small eigenvalues on the classification performance of the model.Secondly,the oversampling method is used to expand and sample the data to solve the problem of unbalanced classification of labeled samples.Finally,tri-training algorithm is used to learn training samples and establish defect prediction model.Experiments using NASA datasets show that compared with the existing four supervised learning and semi supervised learning methods,the proposed method is superior to the existing methods in accuracy,recall and F1-Score.(3)To address the problem of existing methods having too coarse to accurately locate software defects,this paper proposes a software defect location method based on defect prediction and code naturalness.Firstly,the source code is segmented and a code corpus is constructed;Then the N-gram model is used to calculate the cross entropy of all the code lines of the defective module and sort them in descending order;The last row sorted at the top is more likely to be defective.This method uses a method of prediction before positioning,which first predicts the modules that may have defects in the project,and then locates the code line level in the module,so as to solve the problem that the defect prediction grain is too coarse.Experimental results show that the proposed method has better localization performance than the existing methods. |