Font Size: a A A

Class Imbalance Oriented Logistic Regression

Posted on:2016-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y D DongFull Text:PDF
GTID:2308330461450946Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Class-imbalance data sets are quite common in real world. Most state-of-the-art classification algorithms are based on the assumption that the training data sets are roughly balanced. Traditional algorithms do not work well on data sets with imbalanced distribution. In the imbalanced classification problem, the instances of minority class often have more important meaning than the instances of majority class. So you can’t just use the metric like accuracy to assess the performance of imbalanced classification algorithm, however you can use the recall, g-mean value and f-measure to evaluate imbalanced classification algorithm.Logistic regression algorithm is a common used classification algorithm in the field of data mining and machine learning, especially for binary classification problem. The most important advantage of logistic regression is that it is a classification algorithm based on probability and can easily be extended to multiclass classification problem, but the logistic regression is not suitable for the imbalanced classification problem, because its objective function is try to maximize the sum of the logarithm of the probability of each instance been classified correctly without considering whether the instance belongs to minority or majority class. This may lead to more instances of minority class been incorrectly classified as majority class.So on the basis of traditional logistic regression algorithm, this paper provided three object functions based on the characteristics of class distribution of imbalanced data, combined with the traditional Logistic Regression algorithm and three evaluation metrics recall, g-mean and f-measure which is suitable for imbalanced classification problems. We call the tree metric LRM(Logistic and Recall-based Metric), GBM(g-mean based Metric) and FBM(f-measure based Metric). According to these objective functions we provided three classification algorithm called RBLR(Recall based Logistic Regression), GBLR(g-mean based Logistic Regression) and FBLR(f-measure based Logistic Regression) which is fit for class-imbalance classification problems. In the training stage, quasi-newton method is being used to solve the optimization problem and prediction stage uses a similar approach with traditional logistic regression algorithm.Experimental results on 16 data sets show that the three algorithms, provided in this paper, RBLR, GBLR and FBLR can enhance the recall, g-mean and f-measure value compared with traditional logistic regression algorithm, and show significant advantage compared with OSLR(Over-Sampled Logistic Regression) and USLR(Under-Sample Logistic Regression).
Keywords/Search Tags:classification, recall, g-mean, f-measure, imbalanced data sets
PDF Full Text Request
Related items