Font Size: a A A

Research On Principal Component Lasso Dimension Reduction Algorithm Based On Variable Clustering

Posted on:2021-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y J XuFull Text:PDF
GTID:2480306293460414Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
There are many studies on the theoretical system and practical applications of regression models in statistics.In order to make the models built with many variables more interpretable and precise,a sparse method derived from the variable selection methods has been paid attention in recent years.In practice,independent variables have a group structure,and we use the group sparse models such as group lasso and sparse group lasso to handle the problem of variable group structure well.There is a problem of group structure,which is caused by the correlation between independent variables.There is a study,the researcher use the variable clustering to obtain variable groups,and then use the group lasso method to select variable groups.Based on this approach,we propose the algorithm for identifying variable group structure,and apply the variable group structure information obtained by the algorithm for identifying variable group structure to the principal component lasso model,we call the algorithm a principal component Lasso dimensionality reduction algorithm based on variable clustering.The algorithm is applied to linear regression model and logistic regression model to form VPLasso model and VPLasso-Logistic model.The specific work is as follows:First,we start from the high correlation between quantitative variables,and then give the definition of similarity coefficient and the definition of distance under quantitative variables,and we select the hierarchical clustering in the clustering algorithm to group the indepndent variables.In the meantime,we use the adjusted average of the rand index to determine the grouping number of independent variables.This is the algorithm for identifying group structure recognition.Then we use the data simulation results to verify the fact that the algorithm to identify the group structure can recognize the group variables.Secondly,under the condition of a linear model,two representatives of the sparsity method are introduced,and they are named the group lasso model and the sparse group lasso model.How to identify the structure of variable groups as a starting point,in view of the sparseness of variables within groups and the sparseness of variables between groups,we propose the algorithm which is named principal component lasso dimensionality reduction algorithm based on variable clustering.Then we use numerical simulation to examine the variables in two cases,the first case is that we group these variables uniformly,and the second case is that we group variables unevenly.We compare the results of the group Lasso model and the sparse group lasso model and the VPLasso model in six indicators: model prediction accuracy,overall parameter estimation accuracy,total number of variables selected,variable selection sensitivity,variable selection specificity,and accuracy of important variable group selection.We summarize the results that the VPLasso model can filter out important variables and variable groups while ensuring that the root mean square error of the dependent variable and the regression parameters are relatively small.At the same time,we use examples to show that the VPLasso model has strong sparseness.Finally,we apply the dimensionality reduction algorithm to the Logistic regression model,and call the VPLasso-Logistic model.Similarly,we use numerical simulations to analyze the results of the three models on six indicators.The three models are the group Lasso-Logistic model,the sparse group Lasso-Logistic model and the VPLasso-Logistic model,and the indicators are average positive coverage(AR),average positive hit(AP),Re-average(APR)of the Harmonic Mean of Positive Coverage and Hit Rate and four other invariant indicators,then we use the colon data set(colon cancer data)to demonstrate.The simulation and case analysis results show that from a comprehensive view of multiple indicators,the VPLasso-Logistic model has good sparsity of variables within groups and sparsity of variables between groups.
Keywords/Search Tags:sparse model, group structure, variable clustering, principal component lasso
PDF Full Text Request
Related items