Font Size: a A A

Research On Adaboost Improved Algorithm For Unbalanced Data

Posted on:2022-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:J R YanFull Text:PDF
GTID:2518306509465364Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Classification is an important branch in the field of data mining,Common classification models usually assume that there is a small difference in the number of samples of each category in the data set,and the cost of misclassification is equal for each other.However,training the traditional classifier with unbalanced data sets will lead to low prediction accuracy of the model for a few classes.So the unbalanced data classification problem has always been a hot research topic in the field of machine learning.This paper studies classification methods for unbalanced data,introduces undersampling method based on sample weight,sample local density calculation method and the calculation method of misclassification cost of samples,and proposes three improved Adaboost algorithms for unbalanced data.The main work of this paper is as follows:(1)USCBoost(Undersampling and Cost-sensitive Boosting),an undersampling and cost-sensitive unbalanced data classification algorithm is proposed.The algorithm aims to undersample most class samples,and Cost matrix is introduced into weight update formula,Boosting the weight increase of sample of misclassified minority classes faster.Experimental results show that compared with other algorithms,the F1-measure and G-mean values of USCBOOST algorithm are significantly improved,and the proposed algorithm is feasible to deal with the classification of unbalanced data.(2)An Adaboost algorithm based on sample density is proposed.In the algorithm,the local density of each sample is calculated by using the k-nearest neighbor of the sample.The local density of the two kinds of samples is normalized respectively,and the weight of each sample is given,and then used as the initial value in the AdaBoost algorithm.At the same time,the experimental verification of the algorithm proposed in this paper shows that the algorithm has a better ability to identify a few minority classes.(3)An AdaCost algorithm based on isolated forest is proposed,algorithm using isolated forests to get abnormal scores of each sample,and then according to the abnormal scores to calculate the misclassification cost error of each sample,The algorithm calculates the misclassification costs of the two types of samples respectively and then normalizes them so that the sum of the misclassification costs of each type of samples is 1,which effectively distinguishes the in-class samples and inter-class samples and reduces the impact of noise data.(4)An imbalanced data classification system based on ensemble learning is designed and implemented.The system integrates multiple ensemble classification algorithms and base classifier algorithms for imbalanced data,including data set description,parameter setting,classification algorithm selection,and result module,it is convenient for users to choose a more appropriate classification algorithm and improve the efficiency of parameter adjustment for the classification algorithm when modeling unbalanced data.
Keywords/Search Tags:Unbalanced Data, Classification, Ensemble Learning, AdaBoost, Sample Density, Isolated Forest
PDF Full Text Request
Related items