Font Size: a A A

Method Of Imbalanced Data Binary Classification Based On Neighborhood Three-way Decisions And Its Applications

Posted on:2020-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:C L YuFull Text:PDF
GTID:2370330590471770Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Imbalanced data is the data with feature of imbalanced distribution between classes.The classical classification algorithm is based on the balanced distribution between classes.Applying it to imbalanced data will result in insufficient learning rate and the poor classification result of minority data.Resampling can change the space distribution between classes,reduce the imbalanced rate,and solve the problem of imbalanced data.However,most resampling methods are weak at the evaluation of the data space,so that the new space distribution after sampling is different from the original,which affects the classification performance and weakens generalization ability.Using the neighborhood model that can properly measure the data space and the three-way decision theory that is a method for solving complex problem,resampling can rebalance data space distribution with supervision and solve the binary classification for imbalanced data problem.Therefore,combining the neighborhood model and the three-way decision theory,the research on the binary classification for imbalanced data problem is carried out.The main works contains:(1)For the binary classification of imbalanced data problem,with the neighborhood model and three-way decision theory,a novel binary classification algorithm for imbalanced data based on neighborhood three-way decision(NT-IDBC)is developed.Firstly,according to the neighborhood model and the three-way decision theory,the relevant formulas and parameters for data space partitioning are defined.Secondly,with the decision function,the area space of the data is divided.The space with balanced distribution between the majority data and the minority data can be dealt with oversampling.The mixed sampling method contains over-sampling and under-sampling is used in the area space with many majority data.Finally,NT-IDBC is compaired with multiple resampling algorithms and their integrated learning methods by imbalanced data from UCI database,evaluating with F-value and AUC.Through experiments,NT-IDBC has better classification performance in most data sets.(2)In order to improve the speed of NT-IDBC under the condition of large amount of data,combined with parallel computing framework Spark to further optimize the algorithm,PNT-IDBC algorithm is proposed.Firstly,the data is stored in multiple nodes of the distributed cluster by parallel computing framework Spark.The distributed data is divided in parallel.Secondly,the data in different area spaces is mixed sampled in parallel.Finally,from classification accuracy,running time and acceleration ratio,the effectiveness and efficiency of the algorithm are verified by experiments.
Keywords/Search Tags:imbalanced data, binary classification, three-way decision theory, neighborhood model, parallel computing
PDF Full Text Request
Related items