Font Size: a A A

Research On Classification Algorithm For Imbalanced Data

Posted on:2021-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhouFull Text:PDF
GTID:2428330602964570Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the information technology and Internet era,the amount of data in various fields is increasing rapidly at an unprecedented rate.How to achieve intelligent data processing and how to extract valuable information contained in data has become a research and application hotspot in the field of machine learning and data mining.Data classification,as an important subject in the field of data mining,has been widely used in data analysis and intelligent processing.Traditional classification methods can achieve satisfactory results when faced with a balanced dataset.However,in real life,the commonly used data sets are imbalanced,and traditional classification algorithms cannot guarantee the classification effect of the minority samples when facing imbalanced data sets.In this paper,the classification algorithm for imbalanced datasets will be studied from the data level and the algorithm level:(1)At the data level,a weighted bi-directional sampling based on k-means method for imbalanced datasets(WBSK)is proposed.The proposed method firstly uses K-means to cluster the whole data set,then oversamples the data set in some regions with a large number of minority class according to the imbalance ratio by different weights of each cluster,avoids the generation of noise and effectively overcomes imbalances between and within classes.Finally,undersampling the clusters with a large number of minority class to balance the sample number of the whole data set.The experimental results obtained from 11 datasets show that the proposed method is superior to other methods under different classifiers and evaluation criteria.(2)At the algorithm level,a fixed-radius nearest neighbor Progressive competition algorithm(FRNNPC)is proposed.As a preconditioning,FRNNPC eliminates ineligible samples globally through the Fixed-radius nearest neighbor rule,and use the NPC in the obtained candidate data to gradually calculate the score of thenearest neighbor sample of the query sample until the sum of the scores of the one class is higher than another class.In short,this method can effectively deal with the imbalance problem,and does not require any manually set parameters.The experimental results compare the proposed method with the other representative algorithms applied to 10 imbalanced data sets,and illustrate the effectiveness of the algorithm.
Keywords/Search Tags:Oversampling, Undersampling, clustering, imbalanced dataset, nearest neighbor rule
PDF Full Text Request
Related items