Font Size: a A A

Improvement Of Preprocessing Technology And Algorithm On Multi-class Imbalanced Data Set

Posted on:2022-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:P Y XieFull Text:PDF
GTID:2518306764468454Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Modern society is an information society,every moment in the society will produce a large amount of complex data,and how to obtain valuable information from data is what people have been exploring.When working with categorical data sets,class imbalances often occur in the actual data set,i.e.the proportion of one type of data is much larger than that of others.Among them,the multi-class imbalanced problem is the most common,but the traditional classification algorithm is based on the balanced data set design,so the classification of multi-class imbalanced data is a very practical research topic.The solution of multi-class imbalanced problems can be divided into data preprocessing and algorithm level,the resampling method is used in the data preprocessing stage,and the improvement of the ensemble learning method is one of the important improvements based on the algorithm level,the research work of this thesis mainly focuses on the above solution strategies,mainly including:(1)Mahalonobis Distance-based Over-sampling(MDO)is a novel oversampling method that allows newly synthesized minority class samples to have a similar Mahanlanobis distance,keeping the overall covariance of the sample set unchanged.However,the synthesis space of the minority samples selected by this method is easy to cause overlap with each other,and the synthesized minority class samples still have the problem of excessive overlap.In order to solve the problem that MDO synthesizes minority class samples that are prone to excessive overlap,increase the sensitivity of the balanced samples,and make the balanced data samples more consistent with the original sample set,this paper proposes an improved oversampling method——Improved Mahalonobis Distancebased Over-sampling(IMDO).IMDO uses Fisher criterion to project and synthesize sample points,and selects the direction projection synthesis with the smallest variance in the group and the largest variance between groups,which effectively reduces the risk of intruding into other class spaces for minority sample synthesis.(2)In terms of ensemble learning,random forest is an excellent ensemble learning method,taking Bagging and the majority votes to determine the final classification result,which effectively reduces the variance of the classification result,but the classification error is not effectively reduced.For the problem that the error reduction of the random forest algorithm is limited,we take the method of quadratic weighting of random forest to obtain a weighted random forest(WRF).According to the correct classification rate of the leaf nodes of each tree,a linear weighting method is used to re-give each tree a different weight,so that the trees with good classification effect maintain a large weight in the forest,and reduce the weight of trees with high classification error rate.In this way,the classification effect can be improved in the final vote.(3)In this thesis,the combination of IMDO and WRF methods forms the final hybrid integrated processing technique IMDO?WRF,which can be effectively applied to the processing of multi-class imbalanced data sets.In this paper,six indicators of ACC,MAUC,F-measure,G-mean,Precision and Recall are selected,and experiments are designed on10 multi-class imbalanced data sets to measure the performance of the algorithm,and the experimental results verify that IMDO?WRF has significant advantages and stability compared with other imbalanced data processing techniques.
Keywords/Search Tags:minority class samples, IMDO, weighted random forests, multi-class imbalance, hybrid integrated processing technique
PDF Full Text Request
Related items