The Classification Algorithm Research Based On Imbalanced Data

Posted on:2017-04-04

Degree:Master

Type:Thesis

Country:China

Candidate:L W Zhang

Full Text:PDF

GTID:2308330485489363

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Data classification is a very important task of data mining. Domestic and foreign scholars have done a lot of research on the classification. But the traditional method presented above are based on balance data classification, when the traditional method is based on data fields, such as medical diagnosis, anomaly detection, etc., as these data there is an imbalance in the distribution, it results a high false negative rate of minority class. Thus, this paper is to study the subject based on the unbalanced data classification method.The thesis aims to study classification methods on imbalanced data. Research work includes two aspects: the research of traditional classification algorithm, study of existing imbalanced data classification methods which based on the defects of traditional algorithm; a brief introduction of limitations on the DGC and IDGC model, and to propose an improved GIDGC-KNN classification model which is experimentally evaluated.(1) Research on the basic algorithm. Firstly have a research of the traditional classification algorithms such as SVM, KNN, decision trees and AdaBoost methods. Secondly have a research of the data level, the cost sensitive, single-integrated learning and other aspects of classification, such as SMOTE, weight SVM, One Class SVM, SSLM SMOTEBoost.(2) Research on a classification model based on local correlation geodesic distance GIDGC-KNN form the DGC and IDGC. Firstly, give analysis from the gravity data, feature weight selection, data to create a particle classification principles on DGC and IDGC. Since these two models ignores the data distribution properties and test data correlation neighbor class with low accuracy problem, GIDGC-KNN model is put. The model is derived AGC(Amplified Gravitation Coefficient) from the IDGC, combining with geodesic distance and KNN algorithm. And the model data dot creation process using MNP(Maximum neighbor principle), with respect to the MDP(Maximum distance principle) IDGC used. To a certain extent, MNP retains distribution traits of the original data and local relevance.(3) Experimental verification. The experimental data is from KEEL dataset warehouse 22 Type II unbalanced data classification, with AUC and GM as a classification performance evaluation.Compare GIDGC-KNN classification model with traditional sampling techniques, and enhance the cost-sensitive method of comparison. Experimental results show that the model has obvious classification performance.

Keywords/Search Tags:

Data mining, classification, unbalanced data, geodesic distance, K-neighbors, gravity data

PDF Full Text Request

Related items

1	Research On SVM Classification Of Unbalanced Data And Its Application In Identify Poor Students In Colleges And Universities
2	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
3	The Research On Clustering And Classification Algorithm Data Mining And Its Application In Aluminum Data Analysis
4	Research On Classification Algorithms For Unbalanced Data
5	Research On Unbalanced Text Data Set Classification Algorithm
6	Research And Application Of Integrated Algorithms For Unbalanced Data Sets
7	Multi-threshold Based Contrast Pattern Mining And Its Application In Classification Of Imbalanced Datasets
8	Research On Anomaly Detection And Classification Of Labeled Data Based On Data Density
9	Unbalanced Data Classification Algorithm Based On SVM For Research And Application
10	Unbalanced Data Classification Under-sampling Algorithm Based On SVM For Research And Application