Font Size: a A A

Clone Code Harmfulness Prediction Research Of Unbalanced Classification And Feature Selection Problem

Posted on:2018-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2348330512496458Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the large-scale software development and maintenance,developers frequently reuse the source code which cause lots of clone code.Whether the cause of clone is active or passive,it will have a double impact to software development and maintenance.Due to the popularity of clone code and its important impact,research of clone code has become a hot topic in software engineering,and prediction of harmful clone code will help developers and maintainers understanding distribution of defects,and thus contribute to its rational allocation of limited software development resources and improve the efficiency of software testing.At present,how to solve the problem of unbalanced data classification,and how to select the appropriate metrics to characterize clone code is still an important issue.The machine learning method is widely used in the prediction of harmful clone code;imbalanced classification of data will seriously affect the effect of prediction.Clone code harmfulness prediction is a typical classification problem,which is divided into harmful and harmless categories.Focus on the problem of imbalanced classification of harmful and harmless data in the prediction,in order to achieve the goal of improving the prediction of harmfulness,an improved algorithm based on Random Under-Sampling(RUS)was proposed.Besides,researchers have presented a variety of feature metrics from different perspective,which is also used in automatically prediction of harmful clone code.But it lacks a more comprehensive analysis of relationship among different characteristics.Therefore,to solve the problem of irrelevant and redundant features in prediction of harmful clone code,a combination model for selecting harmfulness feature of clone code was proposed based on its relevance and influence.The thesis validates the effectiveness and practicality of the proposed classification imbalance solution and feature selection model;the following studies were carried out.Focusing on the problem of imbalanced classification of harmful and harmless data in the prediction,an algorithm based on Random Under-Sampling was proposed,which could adjust the classification imbalance automatically.Firstly,a sample data set was constructed by extracting static and evolution features of clone code.Then,a new data set of imbalanced classification with different proportion was selected.Next,the harmful prediction was carried out to the new selected data.Finally,the most suitable percentage value of classification imbalance was chosen automatically by observing the different performance of the classifier.The experimental results show that the proposed method can improve the prediction effectively.In addition,to solve the problem of irrelevant and redundant features in harmfulness prediction,a combination model for harmfulness feature selection was proposed based on relevance and influence.Firstly,a preliminary sorting for the correlation of feature data was preceded by the information gain ratio,then the features with high correlation was preserved and other irrelevant were removed to reduce the search space of features.Next,the optimal feature subset was determined by using the wrapper sequential floating forward selection algorithm combined with six kinds of classifiers including Naive Bayes and so on.Finally,different methods of feature selection were compared and analyzed;the experimental results show that the accuracy of the harmfulness prediction model can be greatly improved.For the problem of classification imbalance and feature selection in the harmfulness prediction,two solutions proposed in this thesis,as well as the experimental results and analysis will provide scientific and objective feature data support in research of harmfulness prediction,and then provide support and reference to maintenance management,software quality assessment of clone code.
Keywords/Search Tags:clone code, imbalanced classification, feature selection, inconsistent changes, harmfulness prediction
PDF Full Text Request
Related items