Large margin classification methods for data mining

Posted on:2000-04-22

Degree:Ph.D

Type:Dissertation

University:Rensselaer Polytechnic Institute

Candidate:Wu, Donghui

Full Text:PDF

GTID:1468390014966610

Subject:Mathematics

Abstract/Summary:

In this dissertation, we study large margin classification methods for data mining. First, we examine a support vector machine (SVM) approach for classifying spam e-mail. We address issues like preprocessing of messages, data representation, and training and evaluating SVM. Our SVM based e-mail filters are very accurate, effective and robust. Second, a new extension of SVM, the logical support vector machine, is proposed. The logical SVMs construct learning machines in a special logical feature space, and can handle categorical and numerical data uniformally. The logical SVMs are simple to train and may improve the classification accuracy of noisy data. Third, we examine algorithms for large margin perceptron decision trees (PDT) for induction. In particular, we demonstrate how incorporating margin maximization into an PDT algorithm can dramatically improve classification performance. An experimental study confirms the theoretical results that enlarging margins in PDT improves generalization. Next we examine how these inductive PDT algorithms can be extended to transduction. Transduction uses both labeled training data and unlabeled test data for learning, and infers the labels of unlabeled test data directly without inducing a classifier first. Our transduction algorithm is significantly better than its non-margin-based induction counterpart, but only shows slight advantage over its large margin induction counterpart. Finally, we propose the support vector decision tree method (SVDT), a linear SVM-based large margin PDT algorithm. The SVDT approach is then applied to three large database marketing applications. In all three cases, simple decision trees are constructed that use only a very small fraction of the features. An effective method for producing gainscharts based on SVDT is developed as well. The SVDT algorithm is very suitable for solving large data mining applications. Overall our study shows that enlarging margins in classification is a very effective strategy for inductive learning. This result is true across the many types of methodologies investigated. The advantage of transduction is less clear, so further study is required.

Keywords/Search Tags:

Large margin, Data, Classification, SVM, Support vector, PDT, SVDT, Transduction

Related items

1	Fast Computational Algorithms And Theoretical Analysis In Large-Margin Classification Models
2	Multispectral Data Classification Based On Supprot Vector Machines
3	The Study Of Several Key Issues On Large Data Sets Classification Techniques In Pattern Recognition
4	Two Kinds Of Improved Fuzzy Support Vector Machines
5	Research On Structure Support Vector Machine Classification Models
6	Nonparallel Hyperplanes Support Vector Machines
7	Research On Maximum Margin Classification Theory And Its Application
8	Leaf Margin Classification Based On Local Images Feature
9	Research On Large Margin Classifier Based On Optimizing Margin Distribution
10	On topics of multi-category classification with large margin based methods