Research Of Imbalance Data Over-sampling Technique Based On Three-way Decisions

Posted on:2018-08-16

Degree:Master

Type:Thesis

Country:China

Candidate:L Wang

Full Text:PDF

GTID:2348330569986434

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Imbalanced dataset refers to the dataset where samples used for study are imbalanced between the classes.Most traditional classification algorithms are often poorly performed in dealing with imbalanced dataset,and the results express a preference for the majority class with a bad accuracy on the minority one.The oversampling method is an effective way to solve the problem of imbalanced dataset classification.It can increase the accuracy of minority class.However,it will also decrease the accuracy of majority class and easily synthesize redundant data.In recent years,the research of three-way decision theory application has made some progress.If three-way decision theory is applied to imbalanced dataset processing,it may be an effective way to solve the classification problems of imbalanced dataset.Inspired by the three-way decision theory,the research on sampling method for imbalanced dataset based on three-way decision theory is carried out.The main contents are as follows:(1)Combined with the neighborhood rough set model and three-way decision model,an oversampling method for imbalanced dataset based on three-way decision is proposed(TWD-IDOS).Firstly,to define and explain the neighborhood three-way decision model concepts.Secondly,to use the neighborhood three-way decision model to divide samples in the training set into three regions.And then,to oversample the minority class samples in the boundary region and the negative region,for respectively.Finally,experiments on UCI machine learning repository are used to compare between other methods including oversampling methods,undersampling methods,and ensemble learning methods.The experimental results show that the proposed method can effectively solve the problems of two-class classification in imbalanced dataset and has a better performance in terms of metrics(Recall,F-value,AUC)on C45,KNN and CART classifiers than other oversampling methods in literatures.(2)Combined with Spark distributed parallel computing framework,a parallel oversampling method for imbalanced dataset based on three-way decision is proposed.Firstly,using Spark's RDD to data-transform,the training set was parallelly divided into three regions with the neighborhood three-way decision model.Secondly,to parallelize the boundary region sampling method and the negative region sampling method in TWD-IDOS algorithm,for respectively.On this basis,the effectiveness and efficiency of parallel algorithm in UCI dataset and KDDCUP-99 dataset are verified by the classification algorithm on Weka platform.The experimental results indicate that the parallel algorithm not only keeps its validity,but also greatly reduces the time of learning.Finally,the operation efficiency and parameter sensitivity are analyzed.

Keywords/Search Tags:

imbalanced data, three-way decision, oversampling, parallel, spark

PDF Full Text Request

Related items

1	Research On An Improved Oversampling Method Of Unbalanced Data Set And Parallel Algorithm
2	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling
3	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm
4	Research On Cover-based Algorithms For Oversampling On Imbalanced Data
5	User Complaint Prediction System Based On The KPI Dataset From IPTV Set-Top Box
6	Improved Methods Of Oversampling And Feature Selection Based On Imbalanced Data
7	Research Of Imbalanced Data Classification Method Based On Oversampling And Ensemble Learning
8	Research On Parallel Decision Tree Algorithm Based On Spark
9	Research On Oversampling Method For Multi-class Imbalanced Learning
10	Research On Imbalanced Dataset Classification Based On Oversampling Technique