Font Size: a A A

Research On Oversampling Method For Class Imbalanced Data

Posted on:2023-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z P XieFull Text:PDF
GTID:2568306614972629Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In today’s big data era,there are class imbalance problems in data mining,machine learning,images,texts and other fields.Class imbalance problems generally refer to the unbalanced distribution of the number of samples between different categories.The classification algorithm based on machine learning is more effective when dealing with ordinary balanced data sets,but when the data sets have class imbalance problems,it will produce problems such as class overlap,small sample size,fuzzy boundary and small split group,which will affect the learning of subsequent classifiers and lead to low performance.At present,the methods to solve class imbalance problems can be divided into three categories: data level methods,algorithm level methods and integration methods.This paper focuses on the oversampling method in the data level method,which analyzes the data itself.The data level method can be divided into oversampling,undersampling and mixed sampling.The oversampling method changes the number of data by synthesizing samples to achieve a balance between the two categories.The balanced data generated by this method can be used for various classifiers to learn.Data level method can fundamentally solve the imbalance problem,and has been widely studied and applied.There are many existing oversampling technologies and have achieved certain results.However,most of the most popular methods at present belong to simple random oversampling.The representativeness of the sample points generated by this method is poor,and the sampling rate can not be well determined.The influence of the spatial distribution of samples is not well considered,resulting in the poor quality of the final sample points.In order to solve the above problems,the work of this paper mainly includes the following aspects:(1)An EM clustering oversampling algorithm OEMC(oversampling based on EM clustering)for class unbalanced data is proposed.Firstly,the algorithm adopts clustering technology,measures the similarity between samples through Euclidean distance,and selects the center point of each cluster as the over sampling point,which solves the problem of insufficient importance of samples to a certain extent;Secondly,by directly sampling in the minority sample space,we can better solve the problem that smote,cluster smote and other methods have no pertinence to the cluster space;At the same time,by oversampling 30% of the number of samples in a few classes,it can effectively solve the problem of under sampling based on cluster clustering,blindly pursuing the balance of the number of two classes of samples,and the problem that smote and other algorithms do not specify the sampling rate.(2)An oversampling algorithm MRC & IWSS(multi-level residual clustering and inter layer weighted sample selection)based on multi-level residual clustering and inter layer weighted sample selection is proposed.The MRC & IWSS algorithm adopts multilayer clustering structure,which solves the problem that the current oversampling methods can not make secondary use of oversampling information and improves the quality of synthetic samples;Secondly,the residual structure is used to connect each layer,so that the sample distribution between layers is consistent,and the problem of over sampling performance degradation with the increase of layers is avoided to a great extent;Finally,the algorithm uses the inter layer weighted sample selection mechanism to solve the problem of boundary ambiguity to a certain extent.(3)In order to verify the effectiveness of OEMC algorithm and MRC & IWSS algorithm proposed in this paper,a variety of existing data level resampling methods are selected as the comparison object,and the performance evaluation indexes of multiple imbalance problems are used to compare and analyze the classification performance of traditional classification algorithms on the unbalanced data processed by several algorithms.Finally,the effectiveness of the two data level oversampling algorithms is proved.
Keywords/Search Tags:Unbalanced data, Data level, Oversampling, Residual Clustering, classification
PDF Full Text Request
Related items