Font Size: a A A

Large margin classification methods for data mining

Posted on:2000-04-22Degree:Ph.DType:Dissertation
University:Rensselaer Polytechnic InstituteCandidate:Wu, DonghuiFull Text:PDF
GTID:1468390014966610Subject:Mathematics
Abstract/Summary:
In this dissertation, we study large margin classification methods for data mining. First, we examine a support vector machine (SVM) approach for classifying spam e-mail. We address issues like preprocessing of messages, data representation, and training and evaluating SVM. Our SVM based e-mail filters are very accurate, effective and robust. Second, a new extension of SVM, the logical support vector machine, is proposed. The logical SVMs construct learning machines in a special logical feature space, and can handle categorical and numerical data uniformally. The logical SVMs are simple to train and may improve the classification accuracy of noisy data. Third, we examine algorithms for large margin perceptron decision trees (PDT) for induction. In particular, we demonstrate how incorporating margin maximization into an PDT algorithm can dramatically improve classification performance. An experimental study confirms the theoretical results that enlarging margins in PDT improves generalization. Next we examine how these inductive PDT algorithms can be extended to transduction. Transduction uses both labeled training data and unlabeled test data for learning, and infers the labels of unlabeled test data directly without inducing a classifier first. Our transduction algorithm is significantly better than its non-margin-based induction counterpart, but only shows slight advantage over its large margin induction counterpart. Finally, we propose the support vector decision tree method (SVDT), a linear SVM-based large margin PDT algorithm. The SVDT approach is then applied to three large database marketing applications. In all three cases, simple decision trees are constructed that use only a very small fraction of the features. An effective method for producing gainscharts based on SVDT is developed as well. The SVDT algorithm is very suitable for solving large data mining applications. Overall our study shows that enlarging margins in classification is a very effective strategy for inductive learning. This result is true across the many types of methodologies investigated. The advantage of transduction is less clear, so further study is required.
Keywords/Search Tags:Large margin, Data, Classification, SVM, Support vector, PDT, SVDT, Transduction
Related items