Font Size: a A A

Research On Imbalanced Data Classification Based On Voronoi Diagram

Posted on:2018-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2348330533961383Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification is a very important research direction in data mining which had many research achievements,in the condition of data sample distribution equilibrium conditions these achievements can achieve good results,but in the actual scene,most of the data sets are imbalanced.Imbalanced data set mean include one excellent much catatgory samples,a small number of other categories samples,the former is called the majority class,the latter is called the minority class.When dealing with imbalanced data sets,the existing classification algorithms can not accurately identify the category of each sample.In the imbalanced data set,the distribution of minority class sample relative to decision boundary is different,closer to the possibility of wrong decision boundary is bigger,based on this,this paper proposes an novel algorithms with Voronoi diagram,according to the distribution of minority class sample relative to the decision boundary,giving each minority class samples with different weights,calculate the samples' s weight that fit an rule,then randomly selected sample to synthetic minority class.The classification of imbalanced data set based on Voronoi diagram,mainly include the following improvements:1 A new method for boundary recognition.The higher the classification of imbalanced data sets closer to the decision boundary,traditional classifications algorithm does not deal with the difference property.This paper sets to find approximate decision boundary which separate minority class and majority class samples by constructing the Voronoi diagram with data set,the minimum distance is calculated for each minority class samples as its boundary degree;2 Sampling strategy based on boundary degree.According to the boundary samples to determine the new boundary,transform boundary degree to exponential function with natural constant and normalized the function values,get the sampling probability of each minority class sample,then the sample was randomly selected for sampling.The two step is called V-synth?.3 Deal with local imbalance.The algorithm uses the distance from the sample to the decision boundary as the weights to divide the boundary samples,which is more flexible and accurate,and can greatly reduce the possibility of the synthetic noise samples.However,only considering the sampling probability of a small number of samples,without considering the characteristics of the majority class distribution,the overall equilibrium of the data set and the phenomenon of local imbalance may occur.Therefore,by clusting majority class samples,form some clusters,calulate the influence factor that the distribution density of each cluster to minority class sample,update the probability sample.This algorithm is called V-synth?.Through construct special distribution of the data set and the data set which commonly used in UCI for the experimental analysis,the use of the above two algorithms to deal with the data set can get better classification results.
Keywords/Search Tags:imbalanced data sets, decision boundary, over sampling, voronoi diagrams
PDF Full Text Request
Related items