Font Size: a A A

Research On Data Preprocessing Technologies For Software Defect Prediction

Posted on:2015-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ChenFull Text:PDF
GTID:2308330485490673Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the expansion of computer software applications, the size and complexity of software projects is increasingly growing. The quality of software is attracting more attention. Software defect prediction as an important technology to ensure software quality and improve software reliability has been widely used in software engineering practice. Timely and accurate prediction of defect-prone software entities can help effectively allocate and use limited testing resource. How to predict defects in the software quickly and precisely has become the consistent focus of academic and industrial community. Data preprocessing can effectively improve the quality of datasets for software defect prediction in order to promote the efficiency and performance of prediction. Thus, it also becomes the key in software defect prediction.Data preprocessing techniques existing in current SDP research have some problems in practical application. Firstly, the most commonly used feature ranking techniques for SDP only choose the most relevant features according their importance in differentiating instances of different classes without considering the redundancy in in the relevant features. Secondly, the attention on the quality of instance in current SDP research is limited to the imbalanced distribution between classes. Few studies have focused other problems in instance datasets, such as the high redundancy and noise. Finally, while a great deal of work has been done for feature selection and class imbalance in SDP separately, limited research has been done and reported on combination of different techniques to improve data quality comprehensively.This thesis proposes a feature selection framework with relevance analysis followed by redundancy analysis. A novel feature selection method combining ranking and threshold-based clustering (RTC) is proposed to select the most relevant features with eliminating redundant features meanwhile. Besides, a two-phase data preprocessing method is introduced in this thesis with feature selection followed by instance selection. This method is used to comprehensively solve quality problems of high dimension of features, class imbalance and high redundancy of instances. The results of comprehensive empirical studies in this thesis show that the proposed RTC feature selection method can achieve significantly reduction of features. With remaining average only 11.8% of original feature set, the prediction performance can been improved to some extent on the whole. For the two-phase data preprocessing method, both original feature set and instance set can be reduced significantly. The reduction degree of instances is 63.3%. Compared with only using feature selection, the two-phase data preprocessing can further improve the overall prediction performance.The main contributions of this thesis can be summarized as follows:1. Topics on software defect prediction and data preprocessing technologies are reviewed. First, the background and technology framework of software defect prediction are introduced. Then, a survey work in this filed is done from two main topics, including software metrics, classification algorithms. Finally, data preprocessing technology as the subject of this thesis is discussed from three aspects including data quality problems, related technologies and research progress. Three different data preprocessing technologies including feature selection, class imbalance learning and instance selection are classified.2. A novel feature selection method, RTC is proposed. RTC combines relevance analysis and redundancy analysis to select most relevant features and eliminate redundant features Omeanwhile. First, the motivation of RTC is introduced. Then, the framework of RTC is described. Finally, the implementation details of 2 core stage, including feature ranking and threshold-based clustering, are classified.3. A two-phase data preprocessing approach is proposed. It combines feature selection and instance selection to solve problems caused by high dimension, class imbalance and high redundancy. First, the motivation of the approach is introduced. Then, the framework is described. Finally, the implementation details of 2 core stage, including feature selection and instance selection, are classified.4. Comprehensive empirical studies of RTC and two-phase data preprocessing are conducted. For RTC, first, to assess the effectiveness of RTC,3 research questions with 2 evaluation metrics are designed. Then, RTC is implemented and applied on 4 experimental subjects, compared with other commonly used feature selection methods. Finally,3 previous research questions are answered one by one. The results show that, RTC can not only greatly decrease the size of original feature set, but also keep, sometimes even improve the defect performance. To assess the effectiveness of two-phase data preprocessing approach,3 research questions with 3 evaluation metrics are designed. Then, the approach is implemented and applied on 10 experimental subjects, compared with none-preprocessing and feature selection only. Finally,3 previous research questions are answered one by one. The results show that, two-phase data preprocessing can not only greatly decrease the size of both original feature set and instance set, but also keep, sometimes even further improve the defect performance.
Keywords/Search Tags:Software Defect Prediction, Data Preprocessing, Feature Selection, Class Imbalance, Instance Selection
PDF Full Text Request
Related items