Font Size: a A A

Data complexity in machine learning and novel classification algorithms

Posted on:2007-05-22Degree:Ph.DType:Dissertation
University:California Institute of TechnologyCandidate:Li, LingFull Text:PDF
GTID:1458390005482598Subject:Computer Science
Abstract/Summary:
We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor. We demonstrate its application in data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution.;We propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. Our algorithms usually achieve the lowest training error and are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning. Experiments show that our algorithms work very well with AdaBoost.;AdaBoost is an ensemble methods that aggregate many base hypotheses for binary classification problems. It can be viewed as gradient descent on a margin cost function. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize.;A multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. A coding matrix with strong error-correcting ability may not be overall optimal if the binary problems are too hard for the base learner. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and one-vs-one.
Keywords/Search Tags:Data, Classification, Algorithms, Cost, Multiclass, Binary, Adaboost
Related items