| Variable selection methods have been widely used in data dimensionality reduction,but the data generated by modern scientific and technological methods dominated by microarrays has far exceeded the traditional scale,and a large number of high-dimensional data have appeared.To complicate matters further,a large number of variables in biological data are dependent,and pairwise correlations between them can be very high for genes that share a common biological function or are involved in the same metabolic pathway.For the traditional variable selection method,the single variable selection method selects variables one after another,thereby missing the important group effect,while the group variable selection takes into account the strong correlation between variables,but mainly deals with grouped data for selection.Therefore,the correlation between variables is unknown,small samples of highdimensional data lead to difficult analysis,this paper proposes a group variable selection algorithm based on variable clustering(clv-pc Lasso),which can be divided into two steps of group structure identification and variable selection to achieve data dimensionality reduction,the main contents are as follows:First,group structure identification.The variable selection of strongly correlated data usually adopts Lasso,Elastic net and other algorithms,but the selection variable in this process is often one or more of several strongly correlated variables randomly selected as the selected variable.Based on this,the clustering of variables around latent variables(the clv algorithm)is proposed to deal with the grouping problem of highly correlated data,and the main idea of the algorithm is that the response variable Y and all explanatory variables X are put together and clustered layer by layer.During each layer clustering,the clustering results for each layer are determined by maximizing the within-group correlation of the variable groups.Finally,after obtaining the layer clustering result,the loss of local identity of the variable group is controlled according to the response variable,the optimal aggregation level is determined,and the optimal layer clustering result that meets the requirements is selected.Finally,determine the group structure between the explanatory variables related to Y.Second,variable selection.The principal component Lasso(pc Lasso)algorithm is used to select the variables of the grouped data,and the variable groups and variables are selected at the same time.Starting from the known inter-variable group structure,the sparse intra-group variables and inter-group sparse variables of the principal component Lasso model are realized through the adjustment function,and the variables related to Y are selected to achieve the purpose of data dimensionality reduction.Finally,case analysis.In Chapter 5 of this paper,the algorithm is applied to gene expression profile data to determine the practicality of the algorithm.The whole process can be divided into four steps: the first step is to calculate the gene score through three gene scoring criteria,and the gene is initially screened according to the score.The second step is to integrate the genes and identify the group structure,and the genes are grouped according to their correlation.In the third step,select variable groups and variables,and compare the cross-validation results under different thresholds to determine the more suitable variables.In the fourth step,compared with the Elastic and sparse group Lasso algorithms,the clv-pc Lasso algorithm has the smallest crossvalidation error,has good prediction and generalization ability,performs better in variable selection,and has better practicability. |