Font Size: a A A

A Segmentation and Re-balancing Approach for Classification of Imbalanced Data

Posted on:2012-07-01Degree:Ph.DType:Dissertation
University:University of CincinnatiCandidate:Gong, RongshengFull Text:PDF
GTID:1458390008992220Subject:Engineering
Abstract/Summary:
Classification is one of the important tasks of data mining. Class imbalance -- or differences in class distribution -- has been reported to hinder the performance of standard classification models. This dissertation first presents a systematic study to evaluate the impact of class imbalance on several critical steps of learning, namely feature selection, model fitting and performance evaluation. However, study also shows that class imbalance may not be the only cause to blame for the loss of performance, and the underlying complexity of the problem may play a more fundamental role. In this dissertation, K-S tree, a decision tree method based on Kolmogorov-Smirnov statistic, is proposed to segment the data so that the complex problem can be dissected into easier sub-problems and for each sub-problem class imbalance becomes less challenging. K-S tree is also used to perform feature selection, which not only selects relevant variables but also removes redundant ones. After segmentation, a two-way re-sampling will be performed at segment level and the rebalanced data will be used to fit logistic regression models also at segment level. The effectiveness of the proposed method is demonstrated through three case studies -- automatic detection of microcalcification in Mammogram, San Diego housing refinance prediction and credit risk assessment.
Keywords/Search Tags:Class, Imbalance, Data, Segment
Related items