Font Size: a A A

Research On Methods For Imbalanced Data Classification

Posted on:2016-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:K H SunFull Text:PDF
GTID:2348330488957241Subject:Engineering
Abstract/Summary:PDF Full Text Request
Classification is an important research topic in data mining field. Most of the traditional classification algorithms are based on the assumption that the dataset has a balanced distribution among all classes, however, imbalanced datasets is found in the production practice widely exist. Therefore, research on imbalanced data classification has important value in many fields. Traditional classification algorithm aim at maximizing the overall accuracy, which makes the classification results tend to the majority samples. Nowadays, the methods proposed to solve the problem of imbalanced data classification can be divided into data level, algorithm level and feature level. In this thesis, classification methods of imbalanced data are deeply studied, and the main work and research results are as follows:Firstly, four traditional methods of imbalanced data classification are introduced and simulated, including random over-sampling, random under-sampling, neighbor-weighted K-nearest neighbor algorithm, and imbalanced feature selection algorithm based on random forest. And the advantages and disadvantages of these methods are analyzed. In addition, the evaluation criterion of the imbalanced data classification is introduced, which provides an objective basis for the evaluation of the performance of the algorithm.Secondly, an imbalanced data classification method based on local mean is proposed in this thesis aiming at the problem that the local mean classifier will tend to the majority samples when applied to the imbalanced data classification. The methodology distinguishes the minority samples and majority samples, calculate the mean of different numbers of local samples by different categories. At the same time, aiming at the problem that the local mean classifier ignore the global information of the dataset, the methodology calculate the cumulative distance of the test sample to each category to replace the original single distance, then determine the label of test samples by comparing the cumulative distance of each category. The simulation results show that the method can effectively improve the classification accuracy of the minority class samples, and show strong stability on different datasets.Finally, a classification method based on improved RELIEF-F and ensemble learning imbalanced feature selection is proposed in this thesis aiming at the problem that the RELIEF-F algorithm can not effectively select the key features to distinguish the minority samples and the majority samples. The methodology samples the majority samples to construct multiple balanced training subsets based on the Bagging algorithm, then select the features which greater than the threshold after calculate the feature weights according to each training subsets and ensemble them. Eventually, classify the test samples based on the result of feature selection. The simulation results show that the methodology improve the effect of the feature selection and the performance of the classification.
Keywords/Search Tags:Imbalanced Classification, Local Mean, Ensemble Learning, RELIEF-F, Imbalanced Feature Selection
PDF Full Text Request
Related items