Font Size: a A A

Research Of Imbalance Data Over-sampling Technique Based On Three-way Decisions

Posted on:2018-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2348330569986434Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced dataset refers to the dataset where samples used for study are imbalanced between the classes.Most traditional classification algorithms are often poorly performed in dealing with imbalanced dataset,and the results express a preference for the majority class with a bad accuracy on the minority one.The oversampling method is an effective way to solve the problem of imbalanced dataset classification.It can increase the accuracy of minority class.However,it will also decrease the accuracy of majority class and easily synthesize redundant data.In recent years,the research of three-way decision theory application has made some progress.If three-way decision theory is applied to imbalanced dataset processing,it may be an effective way to solve the classification problems of imbalanced dataset.Inspired by the three-way decision theory,the research on sampling method for imbalanced dataset based on three-way decision theory is carried out.The main contents are as follows:(1)Combined with the neighborhood rough set model and three-way decision model,an oversampling method for imbalanced dataset based on three-way decision is proposed(TWD-IDOS).Firstly,to define and explain the neighborhood three-way decision model concepts.Secondly,to use the neighborhood three-way decision model to divide samples in the training set into three regions.And then,to oversample the minority class samples in the boundary region and the negative region,for respectively.Finally,experiments on UCI machine learning repository are used to compare between other methods including oversampling methods,undersampling methods,and ensemble learning methods.The experimental results show that the proposed method can effectively solve the problems of two-class classification in imbalanced dataset and has a better performance in terms of metrics(Recall,F-value,AUC)on C45,KNN and CART classifiers than other oversampling methods in literatures.(2)Combined with Spark distributed parallel computing framework,a parallel oversampling method for imbalanced dataset based on three-way decision is proposed.Firstly,using Spark's RDD to data-transform,the training set was parallelly divided into three regions with the neighborhood three-way decision model.Secondly,to parallelize the boundary region sampling method and the negative region sampling method in TWD-IDOS algorithm,for respectively.On this basis,the effectiveness and efficiency of parallel algorithm in UCI dataset and KDDCUP-99 dataset are verified by the classification algorithm on Weka platform.The experimental results indicate that the parallel algorithm not only keeps its validity,but also greatly reduces the time of learning.Finally,the operation efficiency and parameter sensitivity are analyzed.
Keywords/Search Tags:imbalanced data, three-way decision, oversampling, parallel, spark
PDF Full Text Request
Related items