Font Size: a A A

Research On Classification Algorithm Of Typical Imbalanced Data Sets

Posted on:2022-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:X Y JiaFull Text:PDF
GTID:2480306536479224Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
In the biomedical field,there is a common phenomenon of imbalanced data sets,and a small number of samples usually has very important value,and the cost of misclassification of a small number of samples is often greater.Therefore,it is very important to improve the classification effect of small class samples in imbalanced dataset.At present,there are two ways to solve the problem of imbalanced sample classification.One is to use the sampling algorithm to balance the data set at the data level,and the other is to improve the classification algorithm.In this paper,a model combining oversampling and stacking classification algorithm is proposed for the classification of wine quality in biological fermentation products,and the method system is transferred to diabetes data set classification in medical diagnosis field,which verifies that the method system has good generalization ability.The main contents of this paper are as follows:(1)The data sets of this paper are all from UCI machine learning database,including 1599 cases of red wine,4898 cases of white wine and 768 cases of diabetes.The data sets have the characteristics of wide distribution range,incomplete data and high imbalance,etc.In this paper,data preprocessing is carried out by data standardization,missing value filling,wine quality grade classification and resampling.(2)Support vector machine,decision tree,random forest and extreme random tree are used to train the original wine data set and the oversampled wine data set respectively.Grid search algorithm is used to optimize the model parameters,and the performance of different classifiers on the two data sets is compared.(3)Two wine datasets are generated by using the balance data of ADASYN algorithm and random under sampling combined with the balance data of ADASYN algorithm.Then support vector machine,decision tree,random forest and extreme random tree are used as primary learners,and xgboost algorithm is used as secondary learners to construct a stacking model to complete the quality classification of two wine datasets.(4)Red wine and white wine are combined into a data set,and the stacking model is used to classify red and white wine.(5)The ADASYN algorithm is used to oversample the diabetes data set,and then the stacking model is used to classify the diabetes data set.The final classification results and data show that: the overall performance and small sample recognition effect of the classifier on the over sampled data set are significantly improved,in which the radial basis function SVM has the highest accuracy rate of 95% for red wine,and the extreme random tree has the highest accuracy rate of93% for white wine.The accuracy of the stacking model was 97.3% and 96.1%respectively in the oversampling red and white wine data set,improved by 2.3% and 3.1%respectively,and the prediction time of a single sample was only about 0.22s;the accuracy of the stacking model was 96.1% and 94.6% respectively in the mixed sampling red and white wine data set,and the classification accuracy of the model for red and white wine types reached 99.6%.The classification accuracy for the diabetes dataset also reached 86.01%,and the classification effect of fewer samples was better.The experimental results show that the model has a good classification effect not only on wine quality and different categories of wine,but also on diabetes data sets,which indicates that the method system in this paper has a good generalization ability.
Keywords/Search Tags:Data Mining, Imbalanced Data Sets, Stacking Model, Wine, Diabetes data set
PDF Full Text Request
Related items