Font Size: a A A

A Non-sampling AdaBoost With Information Entropy For Imbalanced Learning

Posted on:2022-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y XieFull Text:PDF
GTID:2518306509979559Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
In the information globalized society,big data is an important carrier containing various information.It is a hot topic for researchers to deeply mine the valuable information in big data.Under the background of big data,imbalanced data can be seen everywhere,touched in many fields such as medical diagnosis,financial risk prevention.However,because of the imbalanced characteristics of data,traditional classification methods can't achieve good results,so a variety of learning methods for imbalanced data have emerged.At present,sampling is a popular method to learn imbalanced data to improve classification performance.However,adding generated samples often changes the distribution information of the raw data and leads to the decision-making boundary unreliable more or less.In order to avoid changing the distribution of dataset,this paper proposes a non-sampling Ada Boost with information entropy for imbalanced learning(NSAIE).Firstly,the proposed method doesn't conduct sampling technology and keeps the distribution of raw imbalanced datasets.Starting from the aspect of feature selection,information entropy and mutual information are introduced to measure the importance of features.The features correspond to high information entropy value are selected.The selected features have property of weak correlation with other features and strong correlation with class label,which is called entropy-feature.To be specific,we design heuristic entropy-feature which effectively maps the data from a high-dimensional feature space to a low-dimensional space inspired by the mutual information theory.Secondly,the algorithm is improved under the framework of Ada Boost.In order to ensure that important minority samples are correctly classified,we set the adaptive weight for base classifier in the framework.The ensemble learning model approaches the imbalanced boundary step-by-step,and stops the learning process when the metrics are high enough or reaching the maximum number of iterations.In this way,the minimum loss of minority samples can be realized and the maximum number of majority samples can be removed.Finally,we conduct comparative experiments on 12 datasets,where visualization of NSAIE algorithm applied in on some datasets is beneficial to better understand the proposed classification idea.The experimental results show that the proposed algorithm is superior to other 17 common methods and achieves 88.72%,79.41%,80.39%,and 79.41% respectively in metrics of-,-,,and .The scores reflect the superiority of the proposed method.In addition,when NSAIE algorithm is applied to multi-classification dataset,all metrics achieve top 1%,indicating that this method has certain practical significance in solving the multi-classification problems.In order to verify the differences between NSAIE algorithm and other methods,a-test with a significance level of 0.05 is carried out.The experimental results show that NSAIE algorithm has significant advantages compared with other imbalanced learning methods.
Keywords/Search Tags:Imbalanced Learning, Information Entropy, Feature Selection, Ensemble learning
PDF Full Text Request
Related items