Font Size: a A A

Research On The Classification Of Imbalanced Data Sets And Related Problems

Posted on:2021-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhangFull Text:PDF
GTID:2428330620964208Subject:Engineering
Abstract/Summary:PDF Full Text Request
The imbalanced data sets classification problem is a hot topic in the field of data classification.It is a common solution to use over-sampling technologies to preprocess imbalanced data sets so as to help the classifier identify more minority class samples.Because the existing over-sampling methods generally have many problems,such as the density of data set not consistent before and after the balancing process,the area where new samples are generated too small,and the samples likely to be overlapped when the sampling rate is very high,this thesis tries to avoid the above problems as much as possible and proposes novel over-sampling algorithms in order to complete well the skew data sets classification.On the one hand,this thesis proposes two over-sampling methods to deal with the imbalanced digital data sets classification.On the other hand,due to the poor explanatory power of linear interpolation for text,this thesis proposes a new text representation method for the imbalanced text data sets classification.In order to solve the problem of different types of skew data sets classification perfectly,the main work of this thesis are as follows:1.Aiming at the problems in the existing over-sampling methods,such as inconsistent data density before and after over-sampling,the unreasonable allocation of sampling weights for minority samples,and the sparseness evaluation of data set having many drawbacks,this thesis proposes to measure the sparseness of the data set based on the minimum distance between samples.And the farther the neighbor is,the earlier the sample is used to be over-sampled,so an over-sampling method is proposed,which is based on the minimum and maximum distance between minority samples.The sampling strategy enables the classifier to recognize more minority samples after training on the preprocessed data sets,and the ability to recognize majority samples correctly has not been weakened.2.SMOTE selects an auxiliary sample and then makes the synthesis space of the new sample smaller,so the probability of sample overlapping is very high when the imbalance rate is higher.This thesis proposes to select two auxiliary samples and a root sample so that a triangle can be formed to expand the space of compounding new samples and reduce the probability of samples overlapping.In addition,the auxiliary samples are selected from the boundary area,which can make the boundary between different types of samples becoming clearer.The experimental results about 14 imbalance data sets showed that the G-mean obtained on about 85.7% of the data sets and F1 mean obtained on 78.6% of the data set were improved.3.Since the linear interpolation of text has the poor explanatory ability and text representation is an important part of text classification,this thesis improves the text representation method in order to the traditional classification algorithm can achieve better classification effect when applied to the imbalanced text data sets.Based on the defects of the existing text representation methods,which do not fully consider the ability of feature items to distinguish various categories,this thesis proposes a new concept class discrimination ability,and applies it to the text representation of imbalanced data sets.Taking the TF-IDF algorithm as the carrier,TF-IDF-? is proposed to assign a weight for each feature item.Both F1 mean and recall rate are improved,which proves that TF-IDF-? can indeed improve the classification effect of imbalanced text data sets,among which F1 mean is up to 4.07%.
Keywords/Search Tags:imbalanced data sets classification, over-sampling, SMOTE, TF-IDF
PDF Full Text Request
Related items