Font Size: a A A

Sparse Learning in Multiclass Problems

Posted on:2012-06-28Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Li, NanFull Text:PDF
GTID:1458390008495536Subject:Statistics
Abstract/Summary:
Multi-category classification is an important topic in statistical learning and data mining. It has many applications, such as handwritten zip code digit recognition and cancer type DNA microarray classification. Multi-classification in many practical problems is even more challenging by the presence of a large number of candidate predictors. Especially in biological or medical applications such as microarrays, the overwhelming number of variables far exceeds the size of training samples even though and the underlying model may be sparse. In these applications it is essential to identify important variables for achieving classifiers with higher prediction accuracy and better model interpretability.;However, variable selection in multi-classification is much complicated than in binary classification or regression problems. We need to estimate the complex multiple discriminant functions for different classes, and also consider which variables should be included in each function. In this dissertation, we address the multi-classification variable selection problem by introducing a new penalty, the supSCAD penalty. Especially designed for multiclass problems, this new penalty groups all the coefficients by their associated covariates and imposes a SCAD penalty only on the supnorm of each group. This is where its name, supSCAD, came from. We apply the new variable selection method to both soft and hard classification through supSCAD multinomial logistic regression and supSCAD multicategory support vector machine. We show that, with a proper choice of the tuning parameter, the supSCAD multinomial logistic regression can identify the underlying sparse model consistently and has the desired oracle properties. Using local linear and quadratic approximation to deal with the non-concave SCAD and nonlinear multinomial log-likelihood function, both procedures can be reformulated into a series of linear or quadratic programming problems. The performance of the procedures is illustrated by both simulations and real data applications.
Keywords/Search Tags:Applications, Sparse, Classification
Related items