Research On Under-sampling Classification Method Of Unbalanced Data

Posted on:2022-10-21

Degree:Master

Type:Thesis

Country:China

Candidate:P Yuan

Full Text:PDF

GTID:2518306761996519

Subject:Computer Software and Application of Computer

Abstract/Summary:

PDF Full Text Request

Classification is a common task in data mining.Classical classification algorithms are usually applied to improve the accuracy of classification when the data distribution tends to be balanced.With the development of the data age,the amount of data in each application is increasing rapidly.In some applications,the increase in data is not synchronized.Some categories of data increase very quickly,while some categories of data increase is not obvious.This will cause an imbalance in the data categories.Traditional classification algorithms tend to favor most types of data when classifying.Therefore,it is necessary to conduct further research on imbalanced data.In the classification task,not all data contributes to the classification,and the impact is deeper in the unbalanced data distribution.This paper studies the low-contribution data samples and unbalanced data from the data level.In addition,we also researched downstream classification algorithms for unbalanced data problems.The main work includes: 1)Based on the theoretical analysis of the samples with low contribution to the classifier,the OBUS algorithm is proposed.The algorithm first filters the minority sample data that has a negative impact on the classification through k NN.By compressing the sample space of most classes in the vertical separation hyperplane,and then selecting the samples of most classes in the direction of the parallel hyperplane,the sample data with high contribution to classification is obtained.In order to obtain balanced data for learning.2)Based on the proposed OBUS algorithm,we proposed the C-OBUS algorithm.Cluster analysis by adding k-means.Then compress sampling in the clustering results to ensure the invariance of the data distribution before and after sampling.3)For the classification problem,we have studied the Naive Bayes method.Due to the independence assumption of Naive Bayes,its classification effect is affected.Based on the Laplacian matrix,the relationship between feature attributes can be well characterized.We propose the LPNB algorithm.We use the Laplacian matrix to represent the weight of the relationship between the data attributes,and then use PSO to search for the best representation of the Laplacian matrix.Finally,feature weighting is performed.Improve the classification effect of Naive Bayes.4)We combine OBUS,C-OBUS and LPNB to conduct overall research on the data level of the imbalance problem and downstream classification algorithms.We use the data sets from UCI and keel to conduct experiments,compare the effects of LPNB and other classification algorithms,and achieve better classification results.At the same time,experiments show that OBUS and C-OBUS have good effects on imbalanced data classification tasks.Finally,experiments are carried out by combining OBUS,C-OBUS and LPNB algorithms,and good classification results are obtained on imbalanced data sets.

Keywords/Search Tags:

Imbalanced Data, Undersampling in Orthogonal Directions, Undersampling Based on Clustering, Bayesian Classification

PDF Full Text Request

Related items

1	An Undersampling Method Based On KAMILA Clustering And Elimination Of Redundancy
2	Research On Imbalanced Data Undersampling Classification Based On Constructive Covering
3	Research Of Imbalanced Datasets Preprocessing Combined With Clustering
4	Research On Classification Algorithm For Imbalanced Data
5	Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application
6	Hashing-based Undersampling Ensemble For Imbalanced Classification Problems And The Application In Activity Recognition
7	Research On Neighborhood-aware Imbalanced Data Sampling Classification
8	Classification Of Imbalanced Data Based On Margin Distribution Boosting Algorithm
9	Research On Methods For Classifying Imbalanced Data
10	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets