Font Size: a A A

Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm

Posted on:2021-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:J X XuanFull Text:PDF
GTID:2518306353978949Subject:Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence,the classification of medical big data has become an important way to assist medical diagnosis.However,due to the difficulty in collecting disease-like samples,medical data often have data imbalances.The classification of imbalanced data sets appears more and more in the medical field.Imbalanced data sets refer to data sets with inconsistent numbers of samples in each category.However,the classification problems often studied in the medical field and the previous classification algorithms are basically studied for balanced data sets,so when facing imbalanced data sets,the problem of inaccurate classification of elements in a small class of data sets will occur.This paper studies the expansion and classification of several types of imbalanced data sets based on the C-SMOTE(Center-Synthetic Minority Over-sampling Technique)algorithm.(1)An improved C-SMOTE algorithm based on the normal distribution is proposed.In data expansion,the normal random distribution is used instead of the uniform random distribution,so that the new sample points are distributed near the center of the few samples with a higher probability,and the distribution characteristics of the original little data are better simulated,which effectively avoids the deviation of the expanded samples.The center of the sample is the marginalization phenomenon.(2)Estimate and analyze the parameters of the C-SMOTE algorithm based on the improved normal distribution.The most critical issue in the improved C-SMOTE algorithm based on the normal distribution is how to choose the standard deviations(?) of the normal distribution probability density function.This paper takes s(?)(28)s ands (?)(28)s3 respectively,where s is the standard deviation of the normalized small-class data.Control the distribution of the generated data based on the m-3s characteristic of the normal distribution.Then the random forest model is used to classify the expanded data,and the classification effect is compared and analyzed according to the OOB,AUC,F values and G value index values.(3)Based on the variance classes and the standard deviation within the class,analyze the influence of the statistical characteristics of the original few classes of data on the selection of the extended algorithm parameters proposed in this paper.Calculate the distance between classes and the variance of the samples based on the original small class data and the extended data based on s (?)(28)s and s (?)(28)s3 parameters.The results show that the parameters selected when the statistical characteristics of the expanded data are closer to the original data characteristics are the best classification results Selected parameters.Based on the actual classification effect of 5 imbalanced data sets,we can see that the improved C-SMOTE algorithm based on the normal distribution has a better classification effect than the original uniform distribution C-SMOTE algorithm.
Keywords/Search Tags:Unbalanced Data, C-SMOTE Algorithm, Normal Distribution, Parameter Selection, Random Forest Algorithm
PDF Full Text Request
Related items