A Segmentation and Re-balancing Approach for Classification of Imbalanced Data

Posted on:2012-07-01

Degree:Ph.D

Type:Dissertation

University:University of Cincinnati

Candidate:Gong, Rongsheng

Full Text:PDF

GTID:1458390008992220

Subject:Engineering

Abstract/Summary:

Classification is one of the important tasks of data mining. Class imbalance -- or differences in class distribution -- has been reported to hinder the performance of standard classification models. This dissertation first presents a systematic study to evaluate the impact of class imbalance on several critical steps of learning, namely feature selection, model fitting and performance evaluation. However, study also shows that class imbalance may not be the only cause to blame for the loss of performance, and the underlying complexity of the problem may play a more fundamental role. In this dissertation, K-S tree, a decision tree method based on Kolmogorov-Smirnov statistic, is proposed to segment the data so that the complex problem can be dissected into easier sub-problems and for each sub-problem class imbalance becomes less challenging. K-S tree is also used to perform feature selection, which not only selects relevant variables but also removes redundant ones. After segmentation, a two-way re-sampling will be performed at segment level and the rebalanced data will be used to fit logistic regression models also at segment level. The effectiveness of the proposed method is demonstrated through three case studies -- automatic detection of microcalcification in Mammogram, San Diego housing refinance prediction and credit risk assessment.

Keywords/Search Tags:

Class, Imbalance, Data, Segment

Related items

1	Research On Data Imbalance In Visual Tracking
2	Research On Multi-class Imbalance Learning
3	The Research Of Class Imbalance Classification Model In Data Mining
4	Research On Contrast Pattern-based Classification For Imbalanced Data
5	Research On The Application Of Generative Adversarial Networks In Class Imbalance
6	A balanced approach to the multi-class imbalance problem
7	Based On Ensemble Sampling And Data Imbalance Self-adaptive Processing Method In Defect Prediction Context
8	Research On Key Techniques For Class Imbalanced Data Classification
9	Research On Class Imbalance Based On Spark
10	Improvement Of Preprocessing Technology And Algorithm On Multi-class Imbalanced Data Set