Research On The Classification Of Imbalanced Data Sets And Related Problems

Posted on:2021-04-12

Degree:Master

Type:Thesis

Country:China

Candidate:T Zhang

Full Text:PDF

GTID:2428330620964208

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The imbalanced data sets classification problem is a hot topic in the field of data classification.It is a common solution to use over-sampling technologies to preprocess imbalanced data sets so as to help the classifier identify more minority class samples.Because the existing over-sampling methods generally have many problems,such as the density of data set not consistent before and after the balancing process,the area where new samples are generated too small,and the samples likely to be overlapped when the sampling rate is very high,this thesis tries to avoid the above problems as much as possible and proposes novel over-sampling algorithms in order to complete well the skew data sets classification.On the one hand,this thesis proposes two over-sampling methods to deal with the imbalanced digital data sets classification.On the other hand,due to the poor explanatory power of linear interpolation for text,this thesis proposes a new text representation method for the imbalanced text data sets classification.In order to solve the problem of different types of skew data sets classification perfectly,the main work of this thesis are as follows:1.Aiming at the problems in the existing over-sampling methods,such as inconsistent data density before and after over-sampling,the unreasonable allocation of sampling weights for minority samples,and the sparseness evaluation of data set having many drawbacks,this thesis proposes to measure the sparseness of the data set based on the minimum distance between samples.And the farther the neighbor is,the earlier the sample is used to be over-sampled,so an over-sampling method is proposed,which is based on the minimum and maximum distance between minority samples.The sampling strategy enables the classifier to recognize more minority samples after training on the preprocessed data sets,and the ability to recognize majority samples correctly has not been weakened.2.SMOTE selects an auxiliary sample and then makes the synthesis space of the new sample smaller,so the probability of sample overlapping is very high when the imbalance rate is higher.This thesis proposes to select two auxiliary samples and a root sample so that a triangle can be formed to expand the space of compounding new samples and reduce the probability of samples overlapping.In addition,the auxiliary samples are selected from the boundary area,which can make the boundary between different types of samples becoming clearer.The experimental results about 14 imbalance data sets showed that the G-mean obtained on about 85.7% of the data sets and F1 mean obtained on 78.6% of the data set were improved.3.Since the linear interpolation of text has the poor explanatory ability and text representation is an important part of text classification,this thesis improves the text representation method in order to the traditional classification algorithm can achieve better classification effect when applied to the imbalanced text data sets.Based on the defects of the existing text representation methods,which do not fully consider the ability of feature items to distinguish various categories,this thesis proposes a new concept class discrimination ability,and applies it to the text representation of imbalanced data sets.Taking the TF-IDF algorithm as the carrier,TF-IDF-? is proposed to assign a weight for each feature item.Both F1 mean and recall rate are improved,which proves that TF-IDF-? can indeed improve the classification effect of imbalanced text data sets,among which F1 mean is up to 4.07%.

Keywords/Search Tags:

imbalanced data sets classification, over-sampling, SMOTE, TF-IDF

PDF Full Text Request

Related items

1	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE
2	Research On The Classification Of Imbalanced Data Sets And Related Problems
3	The Study On Random-SMOTE For The Classification Of Imbalanced Data Sets
4	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
5	Research On Classification Method Of Imbalanced Data Set Based On Improved Sampling Strategy
6	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
7	Text Classification Algorithm Based On Imbalanced Data Sets
8	Camplaints Text Classification Research Of Imbalanced Data Sets
9	Research On Classification Method Of Imbalanced Data Sets
10	Research On The Classification Algorithm Of Imbalanced Data Sets