Font Size: a A A

Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine Learning

Posted on:2012-08-12Degree:Ph.DType:Dissertation
University:University of Alberta (Canada)Candidate:Pelayo Ramirez, LourdesFull Text:PDF
GTID:1468390011460343Subject:Engineering
Abstract/Summary:
The imbalanced learning problem occurs in a large number of economic and health domains of great importance; consequently, it has drawn a significant amount of interest from academia, industry, and government funding agencies. Several researchers have used stratification to alleviate this problem; however, it is not clear what stratification strategy is in general more effective: under-sampling, over-sampling or the combination of both. Our first topic evaluates the contribution of stratification strategies in the software defect prediction area. We study the statistical contribution of stratification in the new Mozilla dataset, a new large-scale software defect prediction dataset which includes both object-oriented metrics and a count of defects per module. Our second topic responds to the debate about the contribution of over-sampling, under-sampling and the combination of both with the employment of a full-factorial design experiment using the Analysis of Variance (ANOVA) over six software defect prediction datasets. We extend our research to develop a stratification method to mitigate sample selection bias in function approximation problems. The sample selection bias is present when the training and test instances are drawn from a different distribution, with the imbalance dataset problem considered a particular case of sample selection bias. We extend the well-known SMOTE over-sampling technique to continuous-valued response variables. Our new algorithm proves to be a valuable algorithm helping to increase the performance on function approximation problems and effectively reducing the impact of sample selection bias.
Keywords/Search Tags:Sample selection bias, Problem, Software defect prediction
Related items