Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm

Posted on:2021-12-25

Degree:Master

Type:Thesis

Country:China

Candidate:J X Xuan

Full Text:PDF

GTID:2518306353978949

Subject:Mathematics

Abstract/Summary:

PDF Full Text Request

With the rapid development of artificial intelligence,the classification of medical big data has become an important way to assist medical diagnosis.However,due to the difficulty in collecting disease-like samples,medical data often have data imbalances.The classification of imbalanced data sets appears more and more in the medical field.Imbalanced data sets refer to data sets with inconsistent numbers of samples in each category.However,the classification problems often studied in the medical field and the previous classification algorithms are basically studied for balanced data sets,so when facing imbalanced data sets,the problem of inaccurate classification of elements in a small class of data sets will occur.This paper studies the expansion and classification of several types of imbalanced data sets based on the C-SMOTE(Center-Synthetic Minority Over-sampling Technique)algorithm.(1)An improved C-SMOTE algorithm based on the normal distribution is proposed.In data expansion,the normal random distribution is used instead of the uniform random distribution,so that the new sample points are distributed near the center of the few samples with a higher probability,and the distribution characteristics of the original little data are better simulated,which effectively avoids the deviation of the expanded samples.The center of the sample is the marginalization phenomenon.(2)Estimate and analyze the parameters of the C-SMOTE algorithm based on the improved normal distribution.The most critical issue in the improved C-SMOTE algorithm based on the normal distribution is how to choose the standard deviations(?) of the normal distribution probability density function.This paper takes s(?)(28)s ands (?)(28)s3 respectively,where s is the standard deviation of the normalized small-class data.Control the distribution of the generated data based on the m-3s characteristic of the normal distribution.Then the random forest model is used to classify the expanded data,and the classification effect is compared and analyzed according to the OOB,AUC,F values and G value index values.(3)Based on the variance classes and the standard deviation within the class,analyze the influence of the statistical characteristics of the original few classes of data on the selection of the extended algorithm parameters proposed in this paper.Calculate the distance between classes and the variance of the samples based on the original small class data and the extended data based on s (?)(28)s and s (?)(28)s3 parameters.The results show that the parameters selected when the statistical characteristics of the expanded data are closer to the original data characteristics are the best classification results Selected parameters.Based on the actual classification effect of 5 imbalanced data sets,we can see that the improved C-SMOTE algorithm based on the normal distribution has a better classification effect than the original uniform distribution C-SMOTE algorithm.

Keywords/Search Tags:

Unbalanced Data, C-SMOTE Algorithm, Normal Distribution, Parameter Selection, Random Forest Algorithm

PDF Full Text Request

Related items

1	Research On Optimization And Improvement Of Random Forests Algorithm And Its Parallelization
2	Optimization Of Distributed Random Forest Algorithm Based On Hierarchical Subspace
3	Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm
4	Research On Adaptive Feature Selection And Parameter Optimization Algorithm For Random Forest
5	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
6	Research On Intrusion Detection Technology Based On Random Forest Algorithm
7	Analysis Of Unbalanced Grain Loss Data Based On RockSmote-Rf
8	Improvement And Application Of SMOTE Algorithm
9	Research And Application Of Classification Algorithm Based On Unbalanced Data
10	Research On Random Forest Similarity Algorithm