Data complexity in machine learning and novel classification algorithms

Posted on:2007-05-22

Degree:Ph.D

Type:Dissertation

University:California Institute of Technology

Candidate:Li, Ling

Full Text:PDF

GTID:1458390005482598

Subject:Computer Science

Abstract/Summary:

We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor. We demonstrate its application in data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution.;We propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. Our algorithms usually achieve the lowest training error and are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning. Experiments show that our algorithms work very well with AdaBoost.;AdaBoost is an ensemble methods that aggregate many base hypotheses for binary classification problems. It can be viewed as gradient descent on a margin cost function. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize.;A multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. A coding matrix with strong error-correcting ability may not be overall optimal if the binary problems are too hard for the base learner. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and one-vs-one.

Keywords/Search Tags:

Data, Classification, Algorithms, Cost, Multiclass, Binary, Adaboost

Related items

1	A Large-scale Experimental Evaluation And Analysis Of Data Classification Algorithms
2	The Study Of Support Vector Machine Multiclass Classification
3	The Analysis And Design For The Algorithms Of Multiclass Classification Based On SVM
4	The Application Of Improved AdaBoost Algorithm Based On Cost Sensitive In Imbalanced Data
5	Application Of Bt-SVM Multiclass Classification Algorithm In The Assessment Of Teaching
6	Research And Application Of Key Algorithms For Construction Cost Big Data
7	Research On Fast Classification Algorithm Of Multiclass Relevance Vector Machine Based On Clustering
8	Model combination in multiclass classification
9	Web Spam Detection Based On Cost-sensitive Method
10	Research Of View-Learning Based Classification Methods