Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application

Posted on:2021-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:X Zhang

Full Text:PDF

GTID:2428330626955300

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

imbalanced data classification is an important research direction in machine learning and pattern recognition.It has a wide range of application value in fraud detection,medical diagnosis and other fields.The problem of imbalanced data classification refers to that for data sets,the distribution of classes is skewed,the data samples of majority classes cover more than the samples of minority classes,and the samples of minority classes are often more valuable for research,so it is necessary to pay enough attention to minority classes,but the traditional classification methods can not be well solved,so the study of classification based on imbalanced data is a recent problem This is a research hotspot.In this thesis,based on the imbalanced data set,the density based under sampling method is studied,on this basis,it is classified and applied in the fault detection,and the fault detection system based on IIS log is realized.The main work of this paper is as follows:1)US-DP is an under sampling method based on density.This method clusters majority samples by density,sorts the samples according to the density peak value,selects the samples with higher density peak value,then forms a new sample set with minority samples,and constructs a classification model for the adopted data set.This method is based on density,according to the density and sparseness of data distribution,try to select the clustering center of dense data distribution,so as to reduce the impact of noise points.At the same time,through the experimental verification of the method proposed in this paper,it shows that the method has a good effect on imbalanced data classification.2)By using jsp + servlet + jdbc technology to realize the fault detection system based on IIS log,the system is divided into four functional modules:user login module,data preprocessing module,data analysis module,result visualization module.Firstly,the system processes the log data so that itsattributes and formats can be transformed;then,it processes the log data with sampling methods(random undersampling,K-means,Tomek links,US-DP),and uses classification algorithm(C4.5,3-nn,naive Bayes)for classification.

Keywords/Search Tags:

Fault detection, Undersampling, Oversampling, Classification, Imbalanced data

PDF Full Text Request

Related items

1	Research Of Imbalanced Datasets Preprocessing Combined With Clustering
2	Research On Classification Algorithm For Imbalanced Data
3	Research On Imbalanced Data Undersampling Classification Based On Constructive Covering
4	Research On Neighborhood-aware Imbalanced Data Sampling Classification
5	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm
6	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
7	Research On Under-sampling Classification Method Of Unbalanced Data
8	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling
9	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
10	Application And Research Of Optimization Method For Imbalanced Data