Font Size: a A A

Feature Selection Of Class-imbalanced Data Based On Rank Aggregation And Rebalancing

Posted on:2021-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:M J ZongFull Text:PDF
GTID:2428330611459195Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Class-imbalanced data processing has become one of the research hotspots and difficulties in the field of machine learning and data mining.Feature selection is a common method to solve the problem of class-imbalanced data dimension.The purpose is to preserve the features related to a few classes as much as possible to improve the classification effect.The complexity of class-imbalanced data structure brings great difficulty to the subsequent feature selection and classification,so it is necessary to study the feature selection of unbalanced data to improve the classification accuracy.The filtering technique is one of the most simple and common methods for feature selection.In this paper,ten different filtering techniques are used to sort the characteristics of the data,respectively,t-test,Fisher score,Hellinger distance,Relief algorithm,Relief F algorithm,geometric mean,F measure,AUCROC,AUCPRC and R value.Kendall's tau rank correlation was used to test the inconsistency of ten ranking results,and it was concluded that the important feature ranking of different filtering methods was different,and it was difficult to say that one filtering method was always better than the other.In other words,the feature selection based on filtering technique is unstable,and the data imbalance makes the instability worse.Based on the above findings,we propose a strategy of rank aggregation and rebalancing.Rank aggregation can fit several different lists into an optimal list,and the most list is taken as the final basis for feature selection and classification.The main research contents and results are as follows.On the one hand,this paper analyzes the feature selection of rank aggregation by using different kinds of unbalanced ratio,simulation data of different dimensions and real medical data,and the results show that,the rank aggregation algorithm can solve the problem of inconsistent sorted list and ensure that the important variables are not eliminated.For both balanced and unbalanced data classification,the classification results of rank aggregation feature selection are better than that of single filtering technique.On the other hand,in order to eliminate the imbalance of data,seven oversampling techniques were used to balance the class-imbalanced data,respectively,SMOTE,ADASYN,ANS,BLSMOTE,DBSMOTE,SLS and RSLS algorithm.From the perspective of data structure,rank aggregation feature selection and classification were conducted on eight data sets,and it was concluded that the data dimension,imbalance ratio and the correlation between variables tend to be inversely proportional to the classification effect of rank aggregation feature selection,and rank aggregation feature selection improved the class-imbalanced data classification.The second rank aggregation method is proposed to process the rank rank ranking list of ten filtering techniques.The results show that the second rank aggregation is better than the first rank aggregation and the rebalanced rank aggregation,and the second rank polymerization can weaken the influence of unbalanced ratio on classification and improve the classification accuracy.In addition,the classification accuracy was improved by using the medical class-imbalance data set to conduct the feature selection processing of the rebalancing rank polymerization and the second-rank polymerization.
Keywords/Search Tags:rank aggregation, class-imbalanced data, feature selection, filtering technique, rebalanced
PDF Full Text Request
Related items