Font Size: a A A

Research And Application Of Classification Algorithm Based On Unbalanced Data

Posted on:2022-08-02Degree:MasterType:Thesis
Country:ChinaCandidate:W Z GuanFull Text:PDF
GTID:2518306557964149Subject:Logistics Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and Internet technology,all kinds of data information increase rapidly,and unbalanced data problems exist widely,which makes unbalanced data classification become one of the research hotspots in the field of data mining.At present,the unbalanced data classification methods proposed by experts and scholars mainly include data sampling method,ensemble learning method and cost sensitive learning method.Over sampling method may cause over fitting of classification model,increase of training time and noise among similar samples.The undersampling method may delete the important information samples contained in the samples,resulting in insufficient model training and under fitting.This paper proposes a centroid space up sampling algorithm(CSUP)for imbalanced data classification for the first time,Then,aiming at the problems of long sample clustering time,a large number of invalid iterations in the clustering process and unstable clustering results,a CSUP algorithm based on improved k-means clustering algorithm is proposed.Firstly,this paper proposes a centroid space up sampling algorithm(CSUP)for imbalanced data classification.This method solves the problem of imbalanced data sets in the process of data classification.The k-means clustering algorithm is used to solve a few sample clusters.The initial centroid is obtained based on the Euclidean distance of a few sample clusters.Then the Euclidean distance of each centroid is added to get the total Euclidean distance.The Euclidean distance of a single centroid is divided by the total Euclidean distance to get the weight of the sample points Then,the weight is multiplied by the number of the total sample points to be balanced to balance the imbalanced data set,so as to effectively improve the classification efficiency of the classification model and solve the problem of imbalanced data classification.The experimental results show that the classification accuracy of this algorithm is significantly higher than that of random sampling,SMOTE algorithm(Synthetic Minority Oversampling Technique),ensemble learning Ada Boost algorithm,ICIKMDS algorithm and Rotation SMOTE algorithm.However,this method uses the traditional K-means algorithm in the process of clustering a small number of samples.The initial clustering center selected by the machine will lead to a large number of iterations in the clustering process,which will increase the clustering time of samples,lead to the instability of the clustering results and easy to produce noise points.Aiming at the shortcomings of centroid space-based up sampling(CSUP),this paper optimizes it and proposes a CSUP algorithm based on improved k-means clustering algorithm.Based on the principle that the larger the distance is,the clearer the separation is,the improved method selects the initial cluster center,and then iteratively calculates the Euclidean distance from each sample point to each center point.Finally,a storage data structure is proposed to save the information of the distance from each sample point to the cluster center and the cluster center in each iteration.In the next iteration,the samples are not calculated first Point to other clustering centers,but compare and store the information in the structure.The experimental results show that the improved method can avoid repeatedly calculating the distance from each sample to all other clustering centers,save the time and times of calculating the distance,and can better speed up the clustering speed and improve the accuracy,reduce the computational complexity of the algorithm,reduce the running time of the algorithm,and avoid the emergence of local optimal problems in the clustering process.
Keywords/Search Tags:up-sampling, SMOTE algorithm, ensemble learning, unbalanced data, centroid, k-means algorithm
PDF Full Text Request
Related items