Font Size: a A A

Research On Classification Algorithm For Unbalanced Data

Posted on:2024-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiFull Text:PDF
GTID:2568307160475574Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Following the rapidly development of the Internet and information technology,many industries have output large numbers of data with complex and diverse forms every day in recent years,which generally have the characteristics of imbalance and high-dimensional imbalance.Traditional classification models default sample balancing during use,and when the data is unbalanced,the minimized of expected risk will result in the model focus more on large types of samples,while ignore samples of subclasses where misalignment is more expensive.At the same time,high-dimensional data usually have features of irrelevance and redundancy,which make the algorithm complex and weaken performance of classification.To solve problems mentioned,this paper is carried out from three levels included data,algorithm and feature selection,as well as combined with methods of bidirectional sampling,cost sensitivity and feature selection,aim to improve the classification effect of data with characteristics of imbalance and high-dimensional imbalance.Firstly,for the data level of unbalanced data,the identification method of boundary samples is analyzed.Considering that the applicable clustering methods for data with different distributions are different,and the effect of clustering analysis with single clustering method is unstable,therefore,a two-way sampling method is proposed.On the one hand,methods of multiple clustering are applied for cluster fusion to identify boundary samples,and use SMOTE_Tomeklinks mixed sampling to them,so that the imbalanced data can achieve relative balance.On the other hand,the sampling range of this method is boundary samples,which reduces the number of sampling samples and training time,and at the same time learns more fully from small samples,ensuring that data imbalance has less impact on the model.Experiments compare the performance of classification using three methods at 9 public datasets,the results demonstrate that the overall accuracy of classification and small class sample recognition of the proposed method are improved.Secondly,aiming at the algorithm level of unbalanced data,the influence of unbalanced data on the loss function is studied.It is proved that when the split points move in the overlapping interval,the sample possesses characteristics of a few subclass samples mixed with large class samples,so data imbalance will cause small class samples to be misclassified as large class samples.The first derivative of the misclassified subclass sample changes between 0 and 1,and the range of change is so small that it is impossible to make effective predictions on the subclass sample.Therefore,two penalty parameters are introduced for the influence of unbalanced data on the loss function α and β,the value range of the parameters obtained by theoretical derivation is α>1,1<β<0.5(α+1).Furthermore,the obtained loss function is used as the loss function of XGboost and light GBM,as well as the first layer classifier of Stacking ensemble learning to construct Stacking ensemble learning.Experiments compare the classification performance of the nine methods on some public datasets,and the results show that the overall classification accuracy and subclass sample recognition rate are improved,and the classification effect is more stable.Finally,consider unbalanced high-dimensional data at feature selection level,the issue about two-stage feature selection of unbalanced high-dimensional data does not consider small class samples is studied.For the feature selection of subclass samples,three feature selections are constructed: First,only the initial screening of all subclass samples is based on the criterion of variance feature selection;Second,the features after feature selection are found in the original data set to form a new dataset for two-stage feature selection,which ensures that the final selected features have a positive effect on the identification of small class samples.Third,using Stacking ensemble learning training samples,it turns out that the classification performance is effectively improved.Based on the existing research,combined with theories and methods of mathematics,statistics and other multidisciplinary disciplines,focused on the increasingly common unbalanced data and high-dimensional unbalanced data,this paper conducts research of its classification problems from the data level,algorithm level and feature selection level,which is a useful extension of the existing research.
Keywords/Search Tags:Cluster fusion, Bidirectional sampling, Loss function, Stacking integrated learning, Triple feature selection
PDF Full Text Request
Related items