Font Size: a A A

Research On High-dimensional Unbalanced Data Classification Algorithm Based On Feature Selection And Ensemble Learning

Posted on:2022-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q ZhangFull Text:PDF
GTID:2518306731978069Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rise of big data,massive unstructured data presents high-dimensional and unbalanced characteristics,and existing algorithms cannot handle it effectively.Although some researches are aimed at improving high-dimensional features or data imbalances,these methods are only suitable for application in certain specific fields and have poor generalization.How to improve the performance and generalization of the model has become a key direction in this field.This thesis combines the characteristics of the data and improves it from three levels of hybrid sampling,feature selection and ensemble learning.(1)Aiming at the repetitive,abnormal and redundant interference samples that appear in the data set,an editing neighborhood noise cleaning algorithm is designed.Aiming at the characteristics of unbalanced data distribution,an improved hybridsampling based on boundary neighborhood partition and K-means clustering algorithm is proposed.By using an improved oversampling algorithm based on boundary neighborhood partition to oversample the minority samples that are the most difficult to classify on the boundary,synthesize new minority samples,and then use an improved under-sampling algorithm based on K-means clustering to under-sample the majority edge samples that are away from cluster center.(2)Further analyze the redundancy and correlation of features.By introducing the complementarity of features,a FCBF feature selection algorithm based on complementarity is designed.The MIC coefficient and FCBF algorithm are used to filter out irrelevant and redundant features.Then according to the level of feature complementarity,use the classification effect of the C4.5 classifier to evaluate the feature subset and select the optimal feature subset.(3)After hybrid sampling and feature selection,the preprocessed sample is obtained.So as to further obtain a classification model with high classification accuracy,strong adaptability,and good robustness,a two-layer model framework of Stacking is constructed,and support vector machines,Decision trees,random forests,and adaptive boosting are used as the classifiers of the base model layer.These learners have great differences and a single learner has good classification performance.The XGBoost algorithm,which runs faster and has a strong classification effect,is selected as the meta Model layer classifier.Finally,this thesis shows through experimental results that the hybrid sampling algorithm can greatly improve the recognition rate of minority samples.After further selecting the characteristics of the data,the classification accuracy rate is significantly improved.The multi-classifier fusion algorithm based on the Stacking two-layer architecture is superior to a single model in terms of classification performance and generalization ability.
Keywords/Search Tags:High-dimensional Unbalanced Data, Hybrid Sampling, Feature Selection, Ensemble Learning, Stacking
PDF Full Text Request
Related items