Font Size: a A A

Research On Classification Method For Imbalanced Data Sets And Its Application

Posted on:2021-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:J S WeiFull Text:PDF
GTID:2428330611466939Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification is an important method in data mining methods.Classical classification algorithms usually pursue the overall accuracy of classification based on the premise that the distribution of data samples is roughly balanced.However,the imbalanced distribution of dataset is common in practical applications,and minority samples in such datasets are usually more concerned.Affected by imbalanced sample distribution,the traditional classification algorithm is usually biased to the majority class,which cannot effectively avoid the interference of noisy data,so it is not completely suitable for the scenario of imbalanced data classification Therefore,it is necessary to further study the classification method for imbalanced dataIn this paper,the classification method for imbalanced data has been analyzed and innovated from noise filtering algorithm and ensemble learning model,and a novel noise-filtered ensemble model(TWK-LGEE)has been proposed,which contains two parts:noise filtering algorithm(TWK)and ensemble learning model(LGEE).The specific work done in this paper includes:1)the shortcomings of traditional noise filtering methods are analyzed,and noise filtering method(TWK)which combines Tomek-Link and feature weighted KNN is proposed.TWK combines two types of position relationship judgments,and introduces feature weights based on F-test so that it can effectively filter the noise data not only in majority samples but also in minority samples.The scarcity of a minority samples has been considered in TWK,so that false elimination of samples can be avoid by selecting the threshold value,which protect the valuable information in minority samples.2)The ensemble model of EasyEnsemble form is constructed by using LightGBM as the base classifier,which is efficient to improve the overall classification efficiency of ensemble model.3)The traditional EasyEnsemble framework has been improved,and the sampling method of EasyEnsemble has been adjusted according to the size of the imbalance ratio of the dataset categories.When the distribution of the dataset is extremely imbalanced,minority samples are oversampled by Borderline-SMOTE,which guarantee the quality of the sample subset while the distribution of the dataset tends to be balancedThis paper carried out extensive experiments on the dataset of the 2019 Kaggle credit card fraud detection competition.Comparative experiments are performed on the noise filtering method(TWK),integrated classification model(LGEE),and overall model(TWK-LGEE)proposed in the paper.The experimental results show that the TWK algorithm has improved 3.94%and 3.98%on F1 and G-mean respectively compared with the best comparison method The imbalanced ensemble classification model LGEE has improved 15.28%and 14.73%on F1 and G-mean respectively compared to the best-performing comparison model,and shortened the running time by 77.02s.The results of the combination comparison experiment show that TWK-LGEE has the best classification effect among the 9 model combinations.
Keywords/Search Tags:Imbalanced Data Classification, Noise Filtering, Ensemble Learning, LightGBM, EasyEnsemble
PDF Full Text Request
Related items