Rebalancing Strategy And Variable Selection Method Of Class-imbalanced Data

Posted on:2020-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:F Xu

Full Text:PDF

GTID:2370330599455873

Subject:Probability theory and mathematical statistics

Abstract/Summary:

PDF Full Text Request

Class-imbalanced data refers to the data skewed on the category.In recent years,the processing of class-imbalanced data has become a hot spot of statistical research.In real life,many data are often highly unbalanced.How to improve the classification performance in class-imbalanced data is a problem we need to solve.Generally,the traditional classification algorithms are based on the premise of the balanced state of the data set,and they have a good classification effect on the balanced data or evenly distributed data,but a poor effect on the classification of class-imbalanced data.In order to solve this problem,the class-unbalanced data is studied from the perspective of rebalancing,so as to improve the recognition rate of a few class samples and improve the classification performance of class-imbalanced data.In addition,when data presents the dual characteristics of high dimensional and non-equilibrium at the same time,it will make the data preprocessing process more difficult.Variable selection in high-dimensional class non-equilibrium data is still a challenge.For this problem,a sparse regular logical regression(SRLRS)robust variable selection method based on Subsampling is proposed.The main content of this paper includes two aspects: one is to deal with classimbalance data based on the idea of rebalancing;the other is to choose variables for class-imbalance data.The main work and innovation include the following aspects:First,The preprocessing methods of class-imbalanced data are summarized and the data characteristics and problems of class-imbalanced data based on SVM classifier are analyzed.This paper summarizes the variable selection methods of unbalanced data of high-dimensional classes,and analyzes the advantages and disadvantages of various methods.Second,based on the theoretical and experimental analysis of the rebalancing strategy,the paper makes a theoretical analysis on how to improve the classification performance of the typical rebalancing improvement method,that is,summarizes the resampling technology and the limitations of the improved algorithm.In this paper,a simulation study is carried out on the class-imbalanced data with a small number of variables and a large number of samples,the classification performance of the two cases before and after the rebalancing method pretreatment was compared.In the experimental study of real data,the optimal parameters of the established model were determined,and some class-imbalanced sample data sets were selected from the metabolomics data for the comparison of prediction performance.The results showthat the rebalancing method can improve the performance of the classifier.Third,A robust variable selection method for sparse regular logistic regression(SRLRS)based on precision rate-recall curve(PRC)is proposed.At present,there are relatively few variable selection methods for high-dimensional unbalanced data,especially those that apply sparse regularization variable selection method to the unbalanced data of metabolomics.SRLRS takes into account the characteristics of class unbalanced data,Uses hierarchical cross-validation in cross-validation,and makes LHO-LOO in Subsampling sampling.Simulation studies and real data studies show that SRLRS combined with PRC criterion variable selection method is very suitable for class imbalance data.

Keywords/Search Tags:

Class-imbalanced data, Rebalancing, Variable selection, Data preprocessing

PDF Full Text Request

Related items

1	Variable Selection Methods In Statistical Models For Survival Data
2	Research On Classification Algorithm Of Typical Imbalanced Data Sets
3	Research On 2D Spatial Gene Selection Algorithm Based On Unbalanced Gene Data
4	Variable Selection And Feature Screening In High-dimensional Data
5	Variable Selection Based On Longitudinal Survival Data Model
6	Study Of Dna Microarray Data Of Variable Selection Methods
7	Statistical Inference Of Semiparametric Models With Incomplete Data
8	Robust Estimation And Variable Selection Of Two Kinds Of Semi-parametric Models Under High Dimension Data
9	Statistical Estimation And Variable Selection For Semiparametric Models With Complex Data
10	Research On Several Variable Selection Methods And Their Applications In Longitudinal Data