Font Size: a A A

Research On Imbalanced Dataset Classification Algorithm Based On Sampling

Posted on:2022-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:M F LuFull Text:PDF
GTID:2518306602465844Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,a large amount of data has emerged.How to process these data,classify these data and extract useful information from them becomes more and more important.At present,there are many classification algorithms that can effectively and quickly classify data.Most of these algorithms are designed for balanced data sets.However,in practical application,the problem of class imbalance exists in data sets.When using traditional classification algorithms to classify these imbalanced data sets,the classifier will prefer the majority instances and ignore the minority instances.But the minority class is often more important.In order to solve the classification problem of imbalanced data,this thesis studies it from the perspective of data sampling based on the radial basis function.The main works are as follows.Most sampling methods only consider the distribution of the minority instances and ignore the distribution of majority instances in the sampling process.When processing the data sets with noise,outliers and disjoint data distribution,the overlap phenomenon will inevitably occur,which will affect the classification performance of the algorithms.To solve this problem,a hybrid sampling method based on radial basis function is proposed in this thesis.The method based on the radial basis function can effectively take the distribution information of all instances into consideration.At the same time,the synthetics instances do not change the original distribution of dataset,and not result in blurring the decision boundary and overlapping.One of the works that needs to be done in the sampling process is to select the sampling area.The algorithm in this thesis uses the radial basis function to select the sampling area,and calculates the mutual potential through the radial basis function.The synthetics instances generated at the point with smaller mutual minority class potential tend to the area of the original minority instances,which satisfy the rule of the dataset,that is,similar samples should be close to each other.The information carried by the majority instances with greater mutual potential is more effective,so sampling is carried out in the majority area with greater mutual potential.The algorithm is applied to imbalanced data sets.And good experimental results are obtained,which show the effectiveness of the algorithm.In order to solve the problem of between-class imbalance and within-class imbalance of the data set simultaneously,and to ensure that the algorithm can be applied to different shapes and sizes of the data set,an adaptive oversampling method based on clustering is proposed in this thesis.Firstly,the improved density peak clustering algorithm is used to adapt the cluster center to cluster the minority instances.Then,the local densities are used to adaptively determine the number of over-sampling.Finally,the radial basis function is used to calculate the mutual class potential of the minority instances,and instances of all minority sub-clusters can be oversampled separately.The algorithm is combined with different classifiers and applied to imbalanced datasets.And the experimental results show that the algorithm is more effective in processing imbalance data than other algorithms.
Keywords/Search Tags:Imbalanced data, Hybrid sampling, Gaussian radial basis function, Synthetic minority oversampling technique
PDF Full Text Request
Related items