Font Size: a A A

Online Feature Selection For Sparse Data

Posted on:2017-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y K TanFull Text:PDF
GTID:2348330503468498Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of big data today, the amount of data in our society has been exploding. The big data stream has several characteristics: high volume, high velocity, high dimensionality, high sparsity and high class-imbalance. In this setting, batch learning method failed for high cost and insufficient. This paper focus on the online learning approach.Typical online-learning algorithms have at least one weight for every feature, which is too expensive in some applications for space constraints and test time constrains. Online feature selection is to select a subset of relevant features for building effective prediction models. By removing irrelevant and redundant features, feature selection can improve the performance of prediction models by alleviating the effect of the curse of dimensionality, enhancing the generalization performance, speeding up the learning process and improving the model interpretability.To investigate online feature selection problem, this paper propose a Passive-Aggressive Truncated Gradient method, which learn the prediction model in a passive or aggressive way and then select a feature subset by truncated gradient method and truncation techniques.To address the multitask-learning problem, this paper propose multitask collaborative feature selection method. The basic idea is to first build a collaborative model which leverage the global model and single task model, and then select a fixed number of active features from the collaborative model.To tackle the challenge of imbalanced datasets, this paper propose two method:(1) imbalanced margin PA truncated gradient method, which constraints different margin for majority and minority instances respectively(2) PA over-sample method, which synthetic minority instances.
Keywords/Search Tags:Online Learning, Feature Selection, Multitask, Imbalanced Dataset
PDF Full Text Request
Related items