Font Size: a A A

Imbalanced Data Classification Algorithm Based On Unsupervised Intelligent Under Sampling Method

Posted on:2020-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuoFull Text:PDF
GTID:2428330596986790Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The rapid development of information technology has spawned an era of mass production,sharing and application of data,while the cornerstone of discovering data value and conquering the ocean of data is machine learning.Classification is one of the most important problems in this field.The general classification algorithm has a default premise that the number of instances of different categories is comparable and the cost of misclassification is also comparable.However,in practical scenarios,many data are highly imbalanced: the number of samples in one category is much larger than that in other categories,which makes it difficult for general classification learning methods to achieve good classification results.In order to improve the classification performance of imbalanced data,many experts and scholars at home and abroad have done a lot of related research.At present,these studies can be roughly summarized into three levels: First,data reconstruction before model building,mainly using resampling technology to reduce the degree of imbalance between categories,such as under-sampling and over-sampling;Second,improving the classification learning algorithm to adapt to the particularity of imbalanced data sets,such as using different weights when learning different types of samples and introducing disturbances into multiple types of samples;Thirdly,combine the first two methods.Aiming at the particularity of imbalanced data sets,this paper proposes a new intelligent undersampling method based on unsupervised learning,and combines ensemble learning algorithm to better solve the problem of imbalanced data classification.This paper mainly work:1.Inquiry analysis: Analyzing the reasons why traditional classification algorithms face the failure of imbalanced data,and explore the principles and ideas of existing methods and techniques to find out some problems that still exist.2.Data reconstruction: Enlightened by grey system theory,a new under-sampling method is proposed to solve the problems existing in the previous resampling technology.It uses KNN to find the internal rules of samples,and constantly eliminates redundant samples,and retains representative samples until the number of different types of samples is equal.3.Algorithmic integration: Comparing and analyzing some characteristics and performance of commonly used classification learning methods,integrating Bagging and SVM classification algorithms and classifying the reconstructed data.4.Multi-class classification: Some common strategies for dealing with multi-class classification problems are studied,and the method proposed in this paper is extended to the classification of multi-class imbalanced data sets.
Keywords/Search Tags:machine learning, classification, imbalanced data, sampling technology, unsupervised learning, ensemble learning
PDF Full Text Request
Related items