Font Size: a A A

An Imbalanced Data Classification Method Based On Active Learning

Posted on:2016-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y S YaoFull Text:PDF
GTID:2308330461956521Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Classification is an important subject in Data Mining and Machine Learning.As the expansion of the application area,more kinds of data appears in research field and in them, imbalanced data is one kind.Standard classification methods emphasis on the entire accuracy. When dealing with imbalanced data, methods will sacrifice the minority,while for imbalanced data, the minority is more important. Research has shown that active learning with SVM works well on imbalanced data but with it costs high. It’s efficient to research how to reduce the cost and improve classifier.In this paper, A classification method SID-SVM based on active learning is proposed to classify imbalanced data sets. SID-SVM improves the classifier effectively with little iteration cost, lowers the data imbalance ratio and be robust to the imbalance ratio and the scale of data set. It Keeps high-performance and cuts cost.The main works are as follows:SID method is presented for choosing first-training-set:chose the samples that are nearest to the other class. SID method will reduce the iteration cost and data imbalance ratio. Test the method on linearly separable data set, and expand it to the linearly non-separable data set through SVM kernel functions.Propose DC method to chose the most informative samples during iteration:1. randomly chose one sample from the mis-classified samples(if exist) by the current class boundary.2. chose the nearest sample to the current boundary. Then add the chosen samples into the training set. DC method will sharply adjust the class boundary, and it works better on the imbalanced data set.Two layers of optimization are taken:firstly, keeping the performance of the classifier while cutting the calculation of SID with Random Algorithm. Then, computing in parallel on Hadoop. The optimization reduce the cost of pre-process further and make the method more practical.
Keywords/Search Tags:active learning, imbalanced data, SVM, iteration cost
PDF Full Text Request
Related items