Font Size: a A A

Research And Application Of Relief Algorithm Based On Imbalanced Data Set Classification

Posted on:2020-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y HeFull Text:PDF
GTID:2428330620951099Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and computer technology,how to extract valuable information from massive data is a work of practical significance,and has received continuous attention from researchers.The processing of imbalanced data in data mining,especially the identification of minority classes,remains a challenging task.The traditional Relief algorithm is a feature selection algorithm based on two-class classification.This paper studies the application extension of Relief algorithm in multiclass imbalanced dataset.The research results mainly includes the following two aspects:(1)Aiming at the problem of high-dimensional imbalanced data classification,a class imbalance-aware Relief algorithm(imRelief)for the classification of tumors using microarray gene expression data is proposed.In order to correct the ”bias” preference of the traditional Relief algorithm for majority classes,and to consider the characteristics of the scattered distribution of minority class samples,imRelief introduced the distance factor formula and change the way in which traditional Relief algorithm selects samples to update feature weights,to give higher weight to the distinguishing features of minority classes.And combined with the classifier to improve the classification accuracy of minority classes.Finally,experimental results on four high-dimensional imbalanced microarray gene expression data show that imRelief is superior to several other comparison algorithms;(2)Aiming at the problem that the classification accuracy of majority class is lost in the imRelief algorithm and the need to further improve the classification accuracy of minority class,a class dependent dynamic cdRelief algorithm is proposed.The algorithm does not delete any samples in advance when calculating the feature weights to ensure that the majority class sample information is not lost.The algorithm first dynamically estimates the probability P for each sample used to update the feature weights in the training set,and dynamically selects the samples to update the feature weights according to the probability P.Combining the ”one vs one” and ”on vs all” strategy,the class dependent weight for the two classification problems is extended to the multi-classification problems.Then cdRelief gives higher weight to the strong distinguishing features of majority classes and minority classes.Experimental results based on 11 multi-class imbalanced public UCI datasets show that cdRelief is superior to several other comparison algorithms.
Keywords/Search Tags:Data mining, Class imbalanced data classification, Feature selection, Relief algorithm, Class dependent feature weighting
PDF Full Text Request
Related items