Research On Methods For Imbalanced Data Classification

Posted on:2016-04-27

Degree:Master

Type:Thesis

Country:China

Candidate:K H Sun

Full Text:PDF

GTID:2348330488957241

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Classification is an important research topic in data mining field. Most of the traditional classification algorithms are based on the assumption that the dataset has a balanced distribution among all classes, however, imbalanced datasets is found in the production practice widely exist. Therefore, research on imbalanced data classification has important value in many fields. Traditional classification algorithm aim at maximizing the overall accuracy, which makes the classification results tend to the majority samples. Nowadays, the methods proposed to solve the problem of imbalanced data classification can be divided into data level, algorithm level and feature level. In this thesis, classification methods of imbalanced data are deeply studied, and the main work and research results are as follows:Firstly, four traditional methods of imbalanced data classification are introduced and simulated, including random over-sampling, random under-sampling, neighbor-weighted K-nearest neighbor algorithm, and imbalanced feature selection algorithm based on random forest. And the advantages and disadvantages of these methods are analyzed. In addition, the evaluation criterion of the imbalanced data classification is introduced, which provides an objective basis for the evaluation of the performance of the algorithm.Secondly, an imbalanced data classification method based on local mean is proposed in this thesis aiming at the problem that the local mean classifier will tend to the majority samples when applied to the imbalanced data classification. The methodology distinguishes the minority samples and majority samples, calculate the mean of different numbers of local samples by different categories. At the same time, aiming at the problem that the local mean classifier ignore the global information of the dataset, the methodology calculate the cumulative distance of the test sample to each category to replace the original single distance, then determine the label of test samples by comparing the cumulative distance of each category. The simulation results show that the method can effectively improve the classification accuracy of the minority class samples, and show strong stability on different datasets.Finally, a classification method based on improved RELIEF-F and ensemble learning imbalanced feature selection is proposed in this thesis aiming at the problem that the RELIEF-F algorithm can not effectively select the key features to distinguish the minority samples and the majority samples. The methodology samples the majority samples to construct multiple balanced training subsets based on the Bagging algorithm, then select the features which greater than the threshold after calculate the feature weights according to each training subsets and ensemble them. Eventually, classify the test samples based on the result of feature selection. The simulation results show that the methodology improve the effect of the feature selection and the performance of the classification.

Keywords/Search Tags:

Imbalanced Classification, Local Mean, Ensemble Learning, RELIEF-F, Imbalanced Feature Selection

PDF Full Text Request

Related items

1	Research On Feature Selection Algorithm On Imbalanced Data Classification
2	Research On Sentiment Classification Based-upon Imbalanced Data
3	Research On Imbalanced Classification Method Based On XGBoost
4	Research And Application Of Relief Algorithm Based On Imbalanced Data Set Classification
5	Research And Application Of Imbalanced Data Classification Algorithm Based On Ensemble Learning
6	Research Of Ensemble Learning For High-dimensional And Imbalanced Data Classification
7	Two-class Imbalanced Big Data Classification Based On Data Reduction And Ensemble Learning
8	Research On Imbalanced Dataset Classification Based On Ensemble Learning
9	Research On Imbalanced Data Classification Methods Based On Ensemble Learning
10	Research On Imbalanced Data Classification Algorithms Based On Ensemble Learning