Font Size: a A A

The Classification Algorithm Research Based On Imbalanced Data

Posted on:2017-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:L W ZhangFull Text:PDF
GTID:2308330485489363Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data classification is a very important task of data mining. Domestic and foreign scholars have done a lot of research on the classification. But the traditional method presented above are based on balance data classification, when the traditional method is based on data fields, such as medical diagnosis, anomaly detection, etc., as these data there is an imbalance in the distribution, it results a high false negative rate of minority class. Thus, this paper is to study the subject based on the unbalanced data classification method.The thesis aims to study classification methods on imbalanced data. Research work includes two aspects: the research of traditional classification algorithm, study of existing imbalanced data classification methods which based on the defects of traditional algorithm; a brief introduction of limitations on the DGC and IDGC model, and to propose an improved GIDGC-KNN classification model which is experimentally evaluated.(1) Research on the basic algorithm. Firstly have a research of the traditional classification algorithms such as SVM, KNN, decision trees and AdaBoost methods. Secondly have a research of the data level, the cost sensitive, single-integrated learning and other aspects of classification, such as SMOTE, weight SVM, One Class SVM, SSLM SMOTEBoost.(2) Research on a classification model based on local correlation geodesic distance GIDGC-KNN form the DGC and IDGC. Firstly, give analysis from the gravity data, feature weight selection, data to create a particle classification principles on DGC and IDGC. Since these two models ignores the data distribution properties and test data correlation neighbor class with low accuracy problem, GIDGC-KNN model is put. The model is derived AGC(Amplified Gravitation Coefficient) from the IDGC, combining with geodesic distance and KNN algorithm. And the model data dot creation process using MNP(Maximum neighbor principle), with respect to the MDP(Maximum distance principle) IDGC used. To a certain extent, MNP retains distribution traits of the original data and local relevance.(3) Experimental verification. The experimental data is from KEEL dataset warehouse 22 Type II unbalanced data classification, with AUC and GM as a classification performance evaluation.Compare GIDGC-KNN classification model with traditional sampling techniques, and enhance the cost-sensitive method of comparison. Experimental results show that the model has obvious classification performance.
Keywords/Search Tags:Data mining, classification, unbalanced data, geodesic distance, K-neighbors, gravity data
PDF Full Text Request
Related items