Font Size: a A A

Research On Classification Of High-Dimensional Unbalanced Data

Posted on:2023-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhaoFull Text:PDF
GTID:2558307094489714Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The classification and prediction problem has always been one of the most fundamental tasks of machine learning.In real-life application scenarios,more and more data exhibit high-dimensional imbalance,i.e.more features and higher imbalance,such as in credit card fraud detection,abnormal traffic detection,medical research and other scenarios.In these scenarios,minority samples are often focused on.Although conventional machine learning algorithms can achieve high overall accuracy,minority samples are difficult to be correctly classified due to the small number and the interference of a large number of invalid features,so it is particularly important to improve the recognition rate of minority samples in high-dimensional unbalanced data.In this paper,starting from the perspective of data reconstruction,corresponding improvement algorithms are proposed based on the analysis of commonly used oversampling and feature selection algorithms.To address the problem that the quality of the newly synthesized minority class samples by the traditional oversampling algorithm is not high and the boundary between the majority and minority class samples is easily blurred,this paper proposes a Borderline-SMOTE oversampling algorithm based on improved spectral clustering.The algorithm uses the idea of Borderline-SMOTE to classify the minority class samples into safe samples,boundary samples and noisy samples,and then clusters the boundary samples by using the more adaptive spectral clustering.After clustering,a number of boundary sample clusters are obtained,and then the sampling multiplier is automatically calculated based on the number of minority samples in each cluster.The algorithm is validated by visual comparison analysis of the sampling results of different oversampling algorithms on the same dataset and comparison analysis of evaluation indicators on KEEL datasets with different imbalance degrees.In response to the problems that traditional filtered feature selection algorithms have a single judgement criterion and are independent of the learning algorithm,while traditional packed feature selection algorithms have low operational efficiency,this paper proposes a two-stage hybrid feature selection algorithm based on m RMR and genetic algorithms.The algorithm first uses m RMR in the first stage to filter out features that satisfy maximum relevance and minimum redundancy,and then uses these features as a priori information to guide the genetic algorithm in the second stage to construct a better initial population.The effectiveness of the proposed improved algorithm is verified by a comparative analysis of evaluation metrics on KEEL datasets with different numbers of features and an analysis of the relationship between the number of iterations of the genetic algorithm and the fitness of the optimal individuals.Finally,to verify the effectiveness of using the two improved algorithms simultaneously to solve the high-dimensional imbalance problem,the Ionosphere dataset from the UCI was selected for empirical analysis.The results show that either one improved algorithm alone or both improved algorithms can effectively improve the prediction rate of special structures in the ionosphere,and the performance of the classifier is optimal when both improved algorithms are used simultaneously.
Keywords/Search Tags:High-dimensional imbalanced data, Classification, Oversampling, Feature selection
PDF Full Text Request
Related items