Font Size: a A A

Data Distribution-driven Adaptive Hybrid Sampling Method For Imbalanced Data Processing

Posted on:2022-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:J M ZhouFull Text:PDF
GTID:2518306731453514Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In actual data classification tasks,it often faces the problem of uneven distribution of data categories,i.e.,the number of samples belonging to one category far exceeds that of another category.If the classifier is used directly to learn from unbalanced data,the classification result will be biased towards the majority class.Therefore,the study of unbalanced data processing methods has received extensive attention at home and abroad.Unbalanced data processing methods can be roughly divided into algorithmlayer methods,feature layer method and data-layer methods.The unbalanced data processing method at data layer realizes data equalization by adjusting the number of samples,gaining wide ranges of applications.In traditional sampling methods,their interpolation operation is usually blind and they cannot maintain the stability of the data distribution during the interpolation.However,it is worth noting that the spatial distribution information of data samples is closely related to the accuracy of classification models.Under(over)sampling of unbalanced data without consideration of maintaining the original distribution characteristic of samples will cause many problems,such as increasing noise points,destructing rational classification boundaries,and so on.In this thesis,two comprehensive sampling methods based on the characteristics of data distribution are proposed to deal with high-dimensional imbalanced data sets and imbalanced data sets with arbitrarily complex and sparse distributions,respectively.By modeling the spatial distribution characteristics of learning samples and performing constrained sampling of the imbalanced samples based on the achieved distribution models,the unbalanced data can be processed on the basis of maintaining the distribution characteristics of samples,leading to an effective improvement of classifier performance.The main contributions are summarized as follows:(1)A spectral clustering-based adaptive synthetic sampling(SCbADASYN)method is proposed.By introducing the spectral clustering,the internal structure of minority samples(the structural characteristics of clusters)can be over-sampled self-adaptively to obtain relatively balanced data samples consistent to the distribution characteristics of samples.Taking advantage of dimension reduction in spectral clustering,SCbADASYN can perform adaptive sample interpolation of high-dimensional minority imbalanced data sets while keeping its spatial distribution characteristics unchanged.Thus,SCbADASYN can effectively solve the classification bias in imbalanced distribution of high-dimensional data,consequently improve the accuracy of traditional classifiers.(2)An adaptive sampling method based on Variation Bayesian Gaussian mixture model-based adaptive synthetic sampling,VBGMMSampling.The variational inference is introduced to learn an optimal Gaussian mixture model in advance.VBGMM-Sampling can effectively obtain the spatial distribution characteristics of minority samples under any unknown distribution,and solve the problem that traditional clustering methods need to artificially assume the number of classification clusters.Based on the achieved optimal Gaussian mixture model,VBGMMSampling performs an adaptive oversampling of the minority samples and adopts the Tomek-link operation to further balance and clean samples.In theory,the proposed method can process unbalanced data sets with arbitrarily complex distribution characteristics to achieve a relatively balanced samples,keeping their distribution characteristics unchanged and making sure not to destroy the classification boundary between the majority classes and minority classes to improve the classifier performance.(3)Extensive experiments were carried out including experiments on numerical simulation data,UCI public data,real network intrusion data(NSL-KDD and KDD99),and credit card fraud data.Numerical simulation results show that the data-layer method can effectively maintain the data distribution and improve classification accuracy.Experimental results on NSL-KDD and KDD99 data sets show that SCbADASYN can significantly improve the classification performance of traditional classifier models on unbalanced data sets.Credit card fraud detection experiments show that the combination of VBGMM-Sampling and traditional classifiers for extremely unbalanced credit card fraud detection can achieve excellent recognition performance,indicating that the proposed method has the potential for wide promotion and application.
Keywords/Search Tags:Imbalanced data, Spectral clustering, Gaussian mixture model, Variational inference, Adaptive sampling method
PDF Full Text Request
Related items