Font Size: a A A

Research On Imbalanced Data Classification Based On The Distribution Of Near Neighbors

Posted on:2020-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:C W WangFull Text:PDF
GTID:2428330602952276Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Classification is an important research topic in the field of machine learning and data mining.Traditional classification methods include decision tree classification,k nearest neighbor classification,rule-based classification,neural network,support vector machine,naive bayes,and so on.They perform well in an environment where the size of each class in a data set is relatively balanced,but performance is degraded in a data environment where the size of each class is not balanced.In practical applications,there are a large amount of data sets with imbalanced size of each class,which are called an imbalanced data sets.In an imbalanced data set,the majority classes are overwhelming in the classification process,which makes the decision of the classifier dominated by these classes,thus affecting the classification effect of the classifier on instances of the minority classes.In order to improve the sensitivity of traditional algorithms to instances of the minority classes,a study on the classification of imbalanced data based on k nearest neighbors is being carried out in depth,from the perspective of the distribution of near neighbors.The main work is as follows: Due to the complexity of the distribution of imbalanced data,the traditional k nearest neighbor method can not accurately capture the distribution of the minority neighbors,thus reducing its classification performance.To solve this problem,we propose an algorithm called nearest neighbor with double neighborhoods classification algorithm(NNDN).The sparsity of the minority instances in the query neighborhood is determined by the double neighborhoods scheme;The tendency weighting mechanism weights the minority instances that are easy to misjudge,which increases the sensitivity of the algorithm to the minority instances,and finally complete the classification according to the weighted voting rules.The comparison experiments between NNDN algorithm and other related algorithms are carried out on 40 real datasets to explain the validity and applicability of the algorithm.Experiments show that the NNDN algorithm is suitable for imbalanced data classification and is superior to its comparison algorithm in AUC,Recall,Accuracy and other indicators.The traditional k nearest neighbor classifier has low sensitivity to the minority instances and cannot distinguish two instances with the same distance from the query instance when dealing with the imbalanced data classification problem,resulting in poor classification performance.To improve the applicability of the k nearest neighbor classifier,a density-based nearest neighbor classification algorithm(DNN)is proposed.First,the kernel density estimation is used to estimate the density of each class of the query instance,thereby performing density localization on the query instance.Second,the points in the original data space are mapped to the space composed of information of classes' density and distance.Last,in this mapping space,the neighbors are dynamically selected,that is,instances similar to the density environment of the query instance and similar in distance are found,and then classified.The improved nearest neighbor algorithm successfully differentiates two instances with the same distance as the query instance,and keenly captures the local distribution characteristics of imbalanced data,thus improving the classification performance of traditional k nearest neighbor classifier.Experiments are carried out on 15 imbalanced data sets in the real world.Experiments show that the DNN algorithm performs well on the classification of imbalanced data.
Keywords/Search Tags:k nearest neighbors, Imbalanced data, Classification, Sample distribution, Estimation of density
PDF Full Text Request
Related items