Font Size: a A A

Research On Software Defect Prediction Technology For Few-sample Data

Posted on:2022-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:H W ZhangFull Text:PDF
GTID:2518306755451414Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The scale and complexity of software products are increasing day by day with the development of Internet technology,making it more difficult to deliver high-quality,low-cost,and easy-to-maintain software products,and also increasing the probability of defects.Before the software products is released,using software defect prediction technology to build related models can identify modules that are prone to defects in software,so that the company can reasonably allocate limited resources for testing and maintenance,which can greatly reduce costs and improve software quality.Building a stable and efficient defect prediction model usually requires a large number of class-balanced defect data sets,but when building a predictive model,software defect data sets usually face two problems: too few defective samples in the historical defect data set(which is class imbalance problem)and it is difficult for new startup projects to obtain historical defect data with the same distribution,resulting in the defect prediction model trained often failing to achieve the expected results.From the perspective of few sample data,this paper studies the software defect prediction technology based on few sample data.The main contents are as follows:First,a over-sampling method for tackling class imbalance in software defect prediction based on GAN network.There are far fewer defective modules in software defect prediction than non-defective modules,which makes the distribution of the two types of samples unbalanced.At present,many techniques have been proposed to solve the problem of class imbalance,and over-sampling technique is one of the most representative methods.However,most methods using over-sampling techniques will generate many non-diversified synthetic samples at the same time.Using these non-diversified synthetic samples to build a model will reduce the predictive performance of the model.The GAN network can make full use of the spatial relationship of the sample distribution,dig out some hidden related information between the samples,and make the generated new samples more diversified through the alternate optimization of the generator and the discriminator.In view of this,a novel over-sampling method based on GAN network is proposed to solve the problem of too few defective samples in historical defective data sets.Experiments on 26 unbalanced defect data sets show that the performance of this method is better than the existing over-sampling methods.Second,a heterogeneous defect prediction method based on simultaneous semantic alignment.Since historical defect data sets are difficult to obtain in new startup projects,and the collection of training data relies on expert knowledge,which is time-consuming,laborious and error-prone,we explored the use of heterogeneous feature data of the source projects to predict the defect tendency of the software module in the target projects.At present,most heterogeneous defect prediction methods solve heterogeneous problems by learning domain invariant feature subspaces to reduce the differences between domains.However,the source domain and the target domain usually show huge heterogeneity,which makes the domain alignment effect not good.The reason is that these methods ignore the potential knowledge that the classifier should generate similar classification probability distributions for the same category in the two domains,and fail to mine the intrinsic semantic information contained in the data.In view of this,a heterogeneous defect prediction method based on simultaneous semantic alignment is proposed.From the perspective of heterogeneity,the problem that it is difficult to obtain historical defect data with the same distribution for new startup projects is resolved,and the model trained using the mining semantic information is still stable and efficient.Experiments on public heterogeneous data sets from 30 different projects have verified the effectiveness of this method.Third,a software defect prediction system based on few sample data.Aiming at the two methods proposed above,a simple software defect prediction system based on few sample data is designed.Through this system,you can intuitively observe the problem of too few defective samples in the data set,so as to generate the unbalanced data.And you can choose heterogeneous data sets by yourself,set the corresponding parameters,and mine the semantic information in the data to predict heterogeneous defects.You can also clearly see the specific form of the common feature subspace in the process of simultaneous semantic alignment.If you find that the learned feature subspace is not good enough,you can adjust the parameters at any time to retrain.The realization of this system can visually observe the pros and cons of the method proposed in this paper,further verifying the effectiveness of our methods.
Keywords/Search Tags:software defect prediction, few sample data, heterogeneous defect prediction, class imbalance, domain alignment
PDF Full Text Request
Related items