Learning in extreme conditions: Online and active learning with massive, imbalanced and noisy data

Posted on:2010-01-22

Degree:Ph.D

Type:Thesis

University:The Pennsylvania State University

Candidate:Ertekin, Seyda

Full Text:PDF

GTID:2448390002973790

Subject:Computer Science

Abstract/Summary:

This thesis addresses improving the performance of machine learning algorithms with a particular focus on classification tasks with large, imbalanced and noisy datasets. The field of machine learning addresses the question of how best to use experimental or historical data to discover general patterns and regularities and improve the process of decision making. However, applying machine learning algorithms to very large scale problems still poses challenges. Additionally, class imbalance and noise in the data degrade the prediction accuracy of standard machine learning algorithms. The main focus of this thesis is designing machine learning algorithms and approaches that are faster, data efficient and less demanding in computational resources to achieve scalable algorithms for large scale problems. This thesis addresses these problems in active and online learning frameworks. The particular focus of the thesis is on Support Vector Machine (SVM) algorithm with classification problems, but the proposed approaches on active and online learning are also well extensible to other widely used machine learning algorithms.;This thesis first proposes a fast online Support Vector Machine algorithm (LASVM) that has an outstanding speed improvement over the classical (batch) SVM and other online SVM algorithms, while preserving the classiffication accuracy rates of the state-of-the-art SVM solvers. The ability to handle streaming data, speed improvement in both training and recognition and the demand for less memory with the online learning setting enable SVM to be applicable to very large datasets. The effectiveness of LASVM and active learning in real world problems is assessed by targeting the name disambiguation problem in CiteSeer's repository of scientific publications. The thesis then presents an efficient active learning framework to target the expensive labeling problem for producing training data. The proposed method yields an efficient querying system and removes the barriers of applying active learning to very large scale datasets due to the high computational costs. We then point out that even when the labels are readily available, active learning can still be used to reach out to the most informative instances in the training data. By applying active sample selection and early stopping in the online SVM, we show that the algorithm can reach and even exceed the prediction accuracies of baseline setting of LASVM with random sampling. Our experimental results also reveal that active learning can be a highly effective method for dealing with the class imbalance problem. We further propose a hybrid method of oversampling and active learning to form an adaptive technique (named VIRTUAL) to efficiently resample the minority class instances in imbalanced data classification. Finally, we propose a non-convex online SVM algorithm (LASVM-NC) based on the Ramp loss, which has strong ability of suppressing the influences of outliers in noisy datasets. Then, again in the online learning setting, we propose an outlier filtering mechanism that approximates non-convex behavior in convex optimization (LASVM-I). These two algorithms are built on an online SVM solver (LASVM-G) which leverages the duality gap to obtain more trustworthy intermediate models. Our experimental results show that the proposed approaches yield a more scalable online SVM algorithm with sparser models and less computational running time both in the training and recognition phases without sacrificing generalization performance. In the end, we also point out the relation between the non-convex behavior in SVMs and active learning.

Keywords/Search Tags:

Active learning, SVM, Machine learning algorithms, Online, Data, Imbalanced, Thesis, Large

Related items

1	Research On Online Learning Algorithms For Drifting Imbalanced Data Stream
2	Study Of Active Learning Algorithms On Imbalanced Data Using Extreme Learning Machine
3	Research On Binary Imbalanced Large Data Classification And Its Application
4	Research On Online Active Learning For Class-imbalanced Data Stream
5	Algorithms And Applications Of Imbalanced Data Classification Based On Semisupervised Learning
6	The Classification Of Imbalanced Large Data Sets Based On Map Reduce
7	Research On Direct Optimization Of AUC Algorithm For Large-scale Data
8	Research On Extreme Learning Machine For Online Sequential Imbalanced Data Classification
9	Research And Application Of Large-scale Machine Learning Theory
10	Methods for large-scale machine learning and computer vision