Font Size: a A A

Research On Data Drought Key Techniques For Software Effort Data Based On Machine Learning

Posted on:2019-08-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:F M QiFull Text:PDF
GTID:1368330545999871Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Software Effort Estimation(SEE)is the key step of developing a software project,which had attracted a lot of attentions of many researchers.Although there exist many studies to solve the problems that are contained in the SEE,and the existing methods have obtained interesting results,there still exist many practical problems to be solved in the process of SEE.One of the biggest obstacles in the study of SEE is the data drought issue.This thesis putted forward a plan to alleviate data drought issue from different aspects,and has obtained some valuable research results:(1)The effort data with missing values usually contains a lot of useful information,which can be used for helping the training of an estimator.This thesis from the perspective of making full use of data to propose a new imputation method based on low-rank recovery and semi-supervised regression.The thesis first divides the missing data into three scenarios,including missing data only occurs in the independent variables,missing data only appears in the dependent variables,and both the independent variables and dependent variables are occurring missing data.Then,for different missing data scenarios,low-rank recovery and semi-supervised regression techniques are introduced for imputing the missing values.In addition,in order to making the effort data can be effectively utilized for the introduced methods,this thesis designs a data structurization strategy which can transform the u unstructured data into the structured data with class labels.Experiments are conducted on seven different datasets,and the experimental results demonstrate that the proposed method can get better performance than the traditional methods.(2)Sharing data is one of main ways to relieve the data drought issue,yet privacy disclosure has become the main obstacle during the data sharing procedure.Hence,this thesis from the perspective of privacy-preserving to propose an Interval Covering Based Subclass Division and Manifold Learning Based Bi-directional Obfuscation(ICSD&MLBDO)method.In the procedure,this thesis designs a subclass division method based on interval covering theory to create the 'class labels' for the effort data,and then introduce the ideals of the classical privacy-preserving methods to protect the privacy of the SEE data.In addition,we design a new bi-directional obfuscation method which further enhances the privacy of the obfuscated data as well as remains the utility of the data.Experiments are performed on seven different datasets,and the experimental results shown that:the proposed approach can protect the data privacy during the procedure of data sharing.(3)Missing data imputation and data sharing are passive methods to relieve the data drought issue.For this reason,this thesis proposes a method based on open-source projects,which is an active method.This thesis from the aspects of filtering the OSPs,designing the cost metrics,and increasing the effort data online to extract the effort data from the open source projects.In the procedure,we propose an effort data incrementation method based on AdaBoost,namely AdaBoost based estimator with CART(ABCART).ABCART modifies the AdaBoost according to the characteristics of the effort data and makes it more suitable for the case of increasing the effort data.
Keywords/Search Tags:Machine Learning, Software Engineering, Software Effort Estimation, Data Drought Techniques, Open-source SEE Data
PDF Full Text Request
Related items