Font Size: a A A

Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application

Posted on:2021-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2428330626955300Subject:Computer technology
Abstract/Summary:PDF Full Text Request
imbalanced data classification is an important research direction in machine learning and pattern recognition.It has a wide range of application value in fraud detection,medical diagnosis and other fields.The problem of imbalanced data classification refers to that for data sets,the distribution of classes is skewed,the data samples of majority classes cover more than the samples of minority classes,and the samples of minority classes are often more valuable for research,so it is necessary to pay enough attention to minority classes,but the traditional classification methods can not be well solved,so the study of classification based on imbalanced data is a recent problem This is a research hotspot.In this thesis,based on the imbalanced data set,the density based under sampling method is studied,on this basis,it is classified and applied in the fault detection,and the fault detection system based on IIS log is realized.The main work of this paper is as follows:1)US-DP is an under sampling method based on density.This method clusters majority samples by density,sorts the samples according to the density peak value,selects the samples with higher density peak value,then forms a new sample set with minority samples,and constructs a classification model for the adopted data set.This method is based on density,according to the density and sparseness of data distribution,try to select the clustering center of dense data distribution,so as to reduce the impact of noise points.At the same time,through the experimental verification of the method proposed in this paper,it shows that the method has a good effect on imbalanced data classification.2)By using jsp + servlet + jdbc technology to realize the fault detection system based on IIS log,the system is divided into four functional modules:user login module,data preprocessing module,data analysis module,result visualization module.Firstly,the system processes the log data so that itsattributes and formats can be transformed;then,it processes the log data with sampling methods(random undersampling,K-means,Tomek links,US-DP),and uses classification algorithm(C4.5,3-nn,naive Bayes)for classification.
Keywords/Search Tags:Fault detection, Undersampling, Oversampling, Classification, Imbalanced data
PDF Full Text Request
Related items