Approaches to Feature Identification and Feature Selection for Binary and Multi-Class Classificatio

Posted on:2008-07-06

Degree:Ph.D

Type:Dissertation

University:University of Minnesota

Candidate:Zhang, Zisheng

Full Text:PDF

GTID:1448390005959644

Subject:Engineering

Abstract/Summary:

In this dissertation, we address issues of (a) feature identification and extraction, and (b) feature selection. Nowadays, datasets are getting larger and larger, especially due to the growth of the internet data and bio-informatics. Thus, applying feature extraction and selection to reduce the dimensionality of the data size is crucial to data mining.;Our first objective is to identify discriminative patterns in time series datasets. Using auto-regressive modeling, we show that, if two bands are selected appropriately, then the ratio of band power is amplified for one of the two states. We introduce a novel frequency-domain power ratio (FDPR) test to determine how these two bands should be selected. The FDPR computes the ratio of the two model filter transfer functions where the model filters are estimated using different parts of the time-series that correspond to two different states. The ratio implicitly cancels the effect of change of variance of the white noise that is input to the model. Thus, even in a highly non-stationary environment, the ratio feature is able to correctly identify a change of state. Synthesized data and application examples from seizure prediction are used to prove validity of the proposed approach. We also illustrate that combining the spectral power ratios features with absolute spectral powers and relative spectral powers as a feature set and then carefully selecting a small number features from a few electrodes can achieve a good detection and prediction performances on short-term datasets and long-term fragmented datasets collected from subjects with epilepsy.;Our second objective is to develop efficient feature selection methods for binary classification (MUSE) and multi-class classification (M3U) that effectively select important features to achieve a good classification performance. We propose a novel incremental feature selection method based on minimum uncertainty and feature sample elimination (referred as MUSE) for binary classification. The proposed approach differs from prior mRMR approach in how the redundancy of the current feature with previously selected features is reduced. In the proposed approach, the feature samples are divided into a pre-specified number of bins; this step is referred to as feature quantization. A novel uncertainty score for each feature is computed by summing the conditional entropies of the bins, and the feature with the lowest uncertainty score is selected. For each bin, its impurity is computed by taking the minimum of the probability of Class 1 and of Class 2. The feature samples corresponding to the bins with impurities below a threshold are discarded and are not used for selection of the subsequent features. The significance of the MUSE feature selection method is demonstrated using the two datasets: arrhythmia and hand digit recognition (Gisette), and datasets for seizure prediction from five dogs and two humans. It is shown that the proposed method outperforms the prior mRMR feature selection method for most cases.;We further extends the MUSE algorithm for multi-class classification problems. We propose a novel multiclass feature selection algorithm based on weighted conditional entropy, also referred to as uncertainty. The goal of the proposed algorithm is to select a feature subset such that, for each feature sample, there exists a feature that has a low uncertainty score in the selected feature subset. Features are first quantized into different bins. The proposed feature selection method first computes an uncertainty vector from weighted conditional entropy. Lower the uncertainty score for a class, better is the separability of the samples in that class. Next, an iterative feature selection method selects a feature in each iteration by (1) computing the minimum uncertainty score for each feature sample for all possible feature subset candidates, (2) computing the average minimum uncertainty score across all feature samples, and (3) selecting the feature that achieves the minimum of the mean of the minimum uncertainty score. The experimental results show that the proposed algorithm outperforms mRMR and achieves lower misclassification rates using various types of publicly available datasets. In most cases, the number of features necessary for a specified misclassification error is less than that required by traditional methods.

Keywords/Search Tags:

Feature, Class, Uncertainty score, Datasets, Approach, Binary, MUSE

Related items

1	Research On Potential Home Broadband User Identification Problem With Large Scale Imbalanced Datasets
2	Study Of Efficient Feature Selection And Classification Methods For Gene Expression Microarray Datasets
3	Multi-Label Feature Selection Algorithms Based On Fisher Score
4	Classification Methods For Class-imbalanced Datasets Of Unequal Misclassification Costs And Their Applications
5	Cost-Sensitive Feature And Instance Selection For Imbalanced Netwrok Abnormal Datasets
6	The Design And Implementation Of Score Management System Based On Data Mining Technology
7	The Research And Application Of Thresholding Algorithm Based On Class Uncertainty Theory
8	Research On Imbalanced Datasets Classification Based On Machine Learning And Oversampling Methods
9	The Research Of The Approach For Internet Commodities Comprehen-Sive Score Based On User Comments
10	Privacy preservation for training datasets in database: Application to decision tree learning