Classifier design to improve pattern classification and knowledge discovery for imbalanced datasets

Posted on:2010-07-25

Degree:Ph.D

Type:Dissertation

University:The University of North Carolina at Chapel Hill

Candidate:Wang, Kun

Full Text:PDF

GTID:1448390002474949

Subject:Chemistry

Abstract/Summary:

PDF Full Text Request

Imbalanced dataset mining is a nontrivial issue. It has extensive applications in a variety of fields, such as scientific research, medical diagnosis, business, multiple industries, etc. Standard machine learning algorithms fail to produce satisfactory classifiers: they tend to over-fit the larger class but ignore the smaller class.;Numerous algorithms have been developed to handle class imbalance, and limited progress has been achieved in improving prediction accuracy for the smaller class. However, real world datasets may have hidden detrimental characteristics other than class imbalance. Those characteristics usually are dataset specific, and can fail otherwise robust algorithms for other imbalanced datasets. Mining such datasets can only be improved by algorithms tailored to domain characteristics (Weiss, 2004); therefore, it is important and necessary to do exploratory data analysis before classifier design. On the other hand, unmet needs in knowledge discovery, such as lead optimization during drug discovery, demand novel algorithms.;In this study, we have developed a framework for imbalanced dataset mining tailored to data characteristics and adapted to knowledge discovery in chemical datasets. First, we explored the dataset and visualized domain characteristics, and then we designed different classifiers accordingly: for class imbalance, active learning (AL), cost sensitive learning (CSL) and re-sampling methods were designed; for class overlap, Class Boundary Cleaning (CBC) and Class Boundary Mining (CBM) were developed. CBM was also designed for lead optimization: ideally it would detect fine structural differences between different classes of compounds; and these differences could be options for lead optimization.;Methods developed were applied to two datasets, hERG and CPDB. The results from imbalanced hERG liability dataset showed that CBC, CBM and AL were effective in correcting class imbalance/overlap and improving the classifier's performance. Highly predictive models were built; discriminating patterns were discovered; and lead optimization options were proposed. The methodology developed and knowledge discovered will benefit drug discovery, improve hazard test prioritization, risk assessment, and governmental regulatory work on human health and the environmental protection.;Keywords: QSAR, applicability domain, outliers, data mining, data visualization, class imbalance, class overlap, sampling, cost sensitive learning, class boundary cleanining, class boundary minng and active learnining.

Keywords/Search Tags:

Class, Data, Imbalance, Knowledge discovery, Mining, Lead optimization

PDF Full Text Request

Related items

1	The Research Of Class Imbalance Classification Model In Data Mining
2	Research On Multi-class Imbalance Learning
3	Knowledge Discovery from Databases: Cost-sensitive and imbalance learning
4	Based On Cluster Analysis Of The Data Mining Algorithm
5	Research On Contrast Pattern-based Classification For Imbalanced Data
6	Object Detection With Class Imbalance Based On Knowledge Distillation And Data Augmentation Methods
7	Research On Several Key Issues In Unsupervised Knowledge Discovery
8	Research For Aviation Data Mining And Knowledge Discovery Based On ASDI
9	Based On Knowledge Discovery Mechanism Of Enterprisedecision Support Systems Research
10	Research On Data Imbalance In Visual Tracking