Data mining via mathematical programming and machine learning

Posted on:2001-10-06

Degree:Ph.D

Type:Dissertation

University:The University of Wisconsin - Madison

Candidate:Musicant, David R

Full Text:PDF

GTID:1468390014459142

Subject:Computer Science

Abstract/Summary:

This work explores solving large-scale data mining problems through the use of mathematical programming methods. In particular, algorithms are proposed for the support vector machine (SVM) classification problem, which consists of constructing a separating surface that can discriminate between points from one of two classes. An algorithm based on successive overrelaxation (SOR) is presented which can process very large datasets that need not reside in memory. Concepts from generalized SVMs are combined with SOR and with linear programming to find nonlinear separating surfaces. An “active set” strategy is used to generate a fast algorithm that consists of solving a finite number of linear equations of the order of the dimensionality of the original input space at each step. This ASVM active set algorithm requires no specialized quadratic or linear programming code, but merely a linear equation solver which is publicly available. An implicit Lagrangian for the dual of an SVM is used to lead to the simple linearly convergent Lagrangian SVM (LSVM) algorithm. LSVM requires the inversion at the outset of a single (typically small) matrix, and the full algorithm is given in 11 lines of MATLAB code.; Support vector regression problems are considered as well. The problem of tolerant data fitting by a nonlinear surface is formulated as a linear program with fewer variables than that of other linear programming formulations. A generalization of the linear programming chunking algorithm for arbitrary kernels is implemented wherein chunking is performed on both data points and problem variables. The robust Huber M-estimator, a differentiable cost function that is quadratic for small errors and linear otherwise, is modeled exactly in the original primal space of the problem by an easily solvable convex quadratic program for both linear and nonlinear support vector estimators. Experiments show that the above classification and regression techniques show strong performance in accuracy, speed, and scalability on both real-world datasets and synthetic ones. In some cases, datasets on the order of millions of points were utilized. These results indicate that SVMs, typically used on smaller datasets, can be used to solve massive data mining problems.

Keywords/Search Tags:

Data mining, Programming, Problem, SVM, Algorithm, Linear, Used

Related items

1	Research And Development On One Dimensional Optimum Cutting System Base Linear Programming
2	Algorithm Design Based On Linear Programming
3	Optimization methods in data mining and machine learning
4	Research And Design Of Society Economy Dynamic Simulation System Based On Linear Programming
5	The New Privacy Preserving Algorithms For Linear Programming And SVMS
6	The Improvement And Application Of Approximate Linear Programming Algorithm
7	Research On Rectangle Two-Dimensional Guillotine Cutting Stock Problem
8	Classification Algorithm Based On Multi Expression Programming
9	Research On DNA Algorithm Of 0-1 Programming Problem
10	A Recursive Algorithm Of One-Dimensional Cutting Stock Problem Based On Multi-Thread Technique