Font Size: a A A

Research On Imbalanced Dataset Classification Algorithm Based On Ensemble Learning

Posted on:2021-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:D ChenFull Text:PDF
GTID:2428330602996947Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification task is one of the basic research contents in data mining and machine learning.It has a wide range of applications in many fields such as business transactions,financial markets,telecommunications services,data analysis,and scientific research.The traditional classification algorithm is proposed based on the assumption of data balance to optimize the accuracy of the entire data set.In real life,many datasets provided for classification tasks are often imbalanced,such as software defect detection,credit card fraud transaction prediction,medical disease diagnosis,and image retrieval.When existing methods learn the imbalanced data,the classification model biases the prediction results to the majority class and ignores the minority class.In actual applications,the accuracy of detecting the minority class is vital,because it is more expensive to misclassify a minority instance than to misclassify a majority instance.It is of great significance to study how to improve the classification effect of classifiers on imbalanced data.Among the currently proposed algorithms for solving imbalanced data classification,ensemble learning has attracted much attention because it integrates multiple weak classifiers to obtain good generalization performance,but in the face of highly imbalanced data and complex imbalanced data,there is still generalization Not strong question.Therefore,this paper is mainly based on the use of ensemble learning for the classification of imbalanced data.The main research content and innovative work of the paper are as follows:(1)For highly imbalanced dataset,this paper proposes the Distance-based Balancing Ensemble model based on Distance-based Combination Rule(DBE-DCR),and effectively applies it to the classification task of highly imbalanced dataset.DBE-DCR is based on the DBE model.First,the highly imbalanced dataset is divided into multiple low-imbalanced subsets and oversamples in each subset to ensure that there are sufficient learning instances in each subset.Finally,the output result of the DBE model is integrated by the DCR.The DCR considers the relationship between the query instances and the learning instances to adjust the output of the DBE model to obtain better generalization ability.Experiments are performed on 48 sets of imbalanced datasets collected by KEEL's public data repository.The experiments show that DBE-DCR performs comparable or even better performance than the current optimal methods.(2)For complex imbalanced dataset,this paper proposes the Dynamic Ensemble Selection Decision-making(DESD)method based on ensemble learning.The current methods proposed for imbalanced dataset classification do not take into account complex data problems such as class overlap.Some of these algorithms even exacerbate these problems after trying to address the complex-data problem.To solve this problem,this paper proposes a dynamic selection ensemble algorithm.First,the DESD repeated random splitting technique to divide the data set into multiple balanced subsets,which do not include or rarely include problems such as class overlap.Then,a selection criterion combining the overall correctness rate and the minority class correctness rate is proposed to select the classifiers with strong ability to participate in the final integration.The proposed method is also tested and compared on the imbalanced dataset collected by KEEL.Experiments show that the generalization of the proposed DESD algorithm is superior to the state-of-the-art methods.
Keywords/Search Tags:Imbalanced Dataset, Ensemble Learning, Classification Algorithm
PDF Full Text Request
Related items