Font Size: a A A

Research On Classification Technology For Imbalanced Data Sets

Posted on:2021-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z WangFull Text:PDF
GTID:2428330647967274Subject:Intelligent perception and control
Abstract/Summary:PDF Full Text Request
In this paper,a series of researches are conducted on the problem that classification models are difficult to make efficient and accurate predictions of sample categories under the imbalanced data distribution environment.First,the classic imbalanced data set classification algorithm is analyzed and summarized,and the related knowledge and model evaluation indicators used in this paper are described in detail.Then,from the perspective of noise samples,the idea of k nearest neighbors is introduced into the recognition of noise samples,and a KNN noise sample filtering algorithm is proposed.From the perspective of oversampling,in view of the shortcomings of the oversampling algorithm,the SMOTE algorithm is improved,and an imbalanced data set classification algorithm based on improved SMOTE is proposed.Then,from the perspective of reducing algorithm running time and improving model prediction accuracy,combining the clustering algorithm and SVM algorithm,this paper proposes an imbalanced data classification algorithm based on the combination of clustering and SVM.Finally,on the basis of the foregoing work,the algorithm proposed in this paper is applied to the actual problem of human pose classification,and a human pose classification algorithm based on imbalanced data classification is proposed,and comparative experiments are performed to verify its performance.The main work done in this study is as follows:First,in order to improve the synthesis quality of samples,combined with the ideas of k nearest neighbors and clustering,an imbalanced data set classification algorithm based on improved SMOTE is proposed.On the one hand,the algorithm proposes a noise sample recognition model based on the k-nearest neighbor idea;on the other hand,it balances the sample information and guarantees the quality of the synthesized samples during oversampling.The algorithm introduces the idea of clustering to correct the synthesized samples in time.Finally,the advantages of the Ada Boost algorithm are used to perform model training on the balanced sample set.Compared with several classic imbalanced classification algorithms,the experimental results show that the algorithm has a better classification effect and stronger generalization performance.Then,from the aspects of improving classification accuracy and reducing algorithm running time,an imbalanced data classification algorithm based on the combination of clustering and SVM is proposed.The central idea of the algorithm is to under-sample the majority of samples based on the distribution characteristics of the minority of samples.Class clusters are classified according to the distribution characteristics of a small number of samples,and the definition of cluster boundaries is proposed considering the interference of noise samples.Then,in the process of constructing a balanced cluster sample set,the algorithm proposes three principles for sampling the majority of samples based on the characteristics of the samples contained in the cluster.Finally,the SVM algorithm with mixed kernel functions is selected to train the classification model in each balanced cluster sample set,and the final classification model is obtained by linear combination.Experimental verification shows that the algorithm not only effectively improves the prediction accuracy of the whole sample,but also the overall running time of the algorithm is shorter.Finally,based on the foregoing work,the algorithm proposed in this paper is applied to the practical application of human pose classification,and a human pose classification algorithm based on imbalanced data classification is proposed.A comparison experiment with four classification algorithms on the ARe M human pose data set shows that the algorithm proposed in this paper can well solve the problem of low prediction accuracy under the real human pose distribution.
Keywords/Search Tags:K nearest neighbor algorithm, SMOTE algorithm, clustering algorithm, AdaBoost algorithm, mixed kernel function
PDF Full Text Request
Related items