Font Size: a A A

Research On Cover-based Algorithms For Oversampling On Imbalanced Data

Posted on:2022-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:G L TianFull Text:PDF
GTID:2518306323991089Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Classification has always been a hot issue in machine learning and data mining.While in many practical applications,lots of datasets in the real-world have the characteristics of imbalance.The traditional classification models might result in classification bias causing large cost.To address this issue,a variety of techniques to improve classification performance on imbalanced data have been studied.Sampling is a kind of simple and effective method to solve this problem.Although a large number of sampling algorithms have been proposed,most of them only focus on exploring the relationships between samples,and there are few studies on the relationships between sample clusters,which may fail to grasp the overall distribution of the datasets.Main factors that influence the performance of classification on imbalanced data include the imbalance ratio,the size of overlapping areas,the severity of intra-class sub-aggregation,and the proportion of noisy samples.Class overlapping has bad influence on the classification result in a great degree,especially on datasets with a high imbalance ratio.Most of the existing studies only focus on one aspect,but few methods simultaneously target datasets with both high imbalance and high overlap.To solve the above problems,this thesis explores the neighborhood relationships between samples and covers based on the Constructive Covering Algorithm(CCA)and proposes two new oversampling methods to classify imbalanced data.The main contributions are as follows:(1)An oversampling method Cover?SMOTE is proposed based on cover structure.CCA is used to construct covers by exploring the relationships between samples.Then the relationships between covers can be extracted by the K-Nearest Neighbor algorithm.The distribution of the dataset is fully studied to obtain the key minority class samples for generating synthetic tuples.SMOTE is exploited for oversampling on the key minority samples to obtain balanced datasets.To verify the effectiveness of this method,extensive experiments are conducted compared with similar algorithms CTD and CTDE on 8 KEEL datasets under the same experimental conditions,and the results show better performance of our algorithm than the compared methods.The superiority of Cover?SMOTE is further demonstrated by comparing with other 4 oversampling methods on several KEEL datasets.(2)In this thesis,a new method,quantum potential energy based cover optimization oversampling(QPCOO)for handling high imbalance and high overlap is proposed.This algorithm is designed for optimizing sampling on the overlapping area to adjust the decision boundary of the classifier by oversampling the minority class samples.It constructs better covers based on the ranked quantum potential energy of samples so as to extract minority covers in the class overlapping area in a new granularity space.Further,it oversamples just on these extracted minority covers to balance the data.Comprehensive experiments on multiple public KEEL datasets have been conducted to evaluate the advance of our approach.The improvement of classification performance on several traditional algorithms,KNN,SVM and NB,validates the effectiveness of QPCOO for data preprocessing.Additionally,comparison with state-of-the-art oversampling methods demonstrates the superiority of our proposed algorithm,which also illustrates the rationality and feasibility of the main ideas of the proposed algorithm.
Keywords/Search Tags:Imbalanced data, Cover, Oversampling, Quantum potential energy, Classification
PDF Full Text Request
Related items