Font Size: a A A

Research On Resampling Methods For Imbalance Data

Posted on:2019-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:B Q DuanFull Text:PDF
GTID:2428330551958754Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The imbalance data classification is one of the important problems in the field of machine learning and pattern recognition.In practical application,the minority class has less objects but more important significance.However,the traditional way of classification is more likely to ensure the overall accuracy.Leading to the performance of algorithm,more bias in the majority,while ignoring the minority,which affects the classifier' s recognition of minority.In recent years,the methods of over-sampling have been widely used in the field of classification for imbalance class.SMOTE(Synthetic Minority Oversampling Technique)is a classical algorithm based on resampling technology presented by Chawla.To some extent,SMOTE has alleviate the imbalance degree of data but is apt to lead to over-fitting.Remove majority samples to achieve the relative balance of the number of the majority and minority is also a simple and intuitive resampling method,the under-sampling method.However,most under-sampling methods eliminate majority samples without distinction can easily lose valuable information in majority.In view of the above issues,this paper conducts research from oversampling and undersampling.1)DS-SMOTE(Density based Synthetic Minority Oversampling Technique).The The DS-SMOTE algorithm identifies the sparse samples based on the density of the samples and uses them as the seed samples in the sampling process.Then,to create synthetic sample between seed and its neighbor using the process of SMOTE.Then a small number of sets equal to the majority of the samples are obtained.The experimental results show that the DS-SMOTE algorithm can effectively improve the classification accuracy of minority class compared with other similar methods,and has certain advantages in dealing with imbalance data classification problems.2)Dissimilarity matrix based under-sampling classification method.This method divides the relationship between the sample and its neighbors into four situations through the dissimilarity matrix of the sample,selectively removes most of the classes,and adds the Boosting process to ensure that the sample is fully trained.Experimental results show,the Dissimilarity matrix based under-sampling classification method has greatly improved the classification accuracy of minority classes compared with similar algorithms.In this paper,the imbalanced data classification methods are discussed and researched and two resampling-based algorithms are proposed,which effectively solves the problem of low classification performance of minority classes in the classification process.However,these two algorithms also have some limitations.How to adapt to the actual imbalance data set still needs further study.
Keywords/Search Tags:Imbalance Datasets, classification, over-sampling, undersampling
PDF Full Text Request
Related items