| Compositional data are widely defined as multivariate observation data carrying relative information,consisting of the proportions or percentages of the various parts that make up the whol.There are a lot of high-dimensional compositional data in microbiology,especially in the intestinal tract and other places.Because the compositional data is the special data in the field of statistics,it has "non-negativity" and "fixed sum",so the traditional statistical methods of high-dimensional general data cannot directly use the high-dimensional compositional data.How to quickly and effectively find out the important predictors from microbial high-dimensional compositional data is a problem worthy of in-depth study.As the measurement of distance between each pair of microbial samples,distance plays an important role in the statistical analysis of microbiome data.The result of the analysis depends on the choice of distance measurement.Therefore,this thesis takes the special data type of high-dimensional compositional data as the research object,and constructs a variable selection method of high-dimensional compositional data based on partial distance correlation.The research is of great value both in theory and in practice.This thesis draws on the basic idea of the DC-SIS,a variable selection method based on distance correlation proposed by Li et al.(2012),and proposes a new variable selection method of high-dimensional compositional data based on partial distance correlation,the PDC-PSIS(Aitchison).Through numerical simulation,the limited sample performance of the PDC-PSIS(Aitchison)and the screening effect of grouping predictors were obtained,and the intestinal bacteria of 250 adult twin volunteers stored in the European Institute of Bioinformatics by Xie et al.(2016)Group abundance data is an empirical analysis for the research object to explore the effectiveness of the PDC-PSIS(Aitchison)method in the selection of high-dimensional compositional data variables.The specific research content of this thesis includes:(1)Considering the particularity of the compositional data,a non-iterative PDC-SIS(crude)method is proposed,which is based on the partial distance correlation and uses the metric invariance of the centralized logratios transformation to eliminate the determination and restriction of the compositional data,and then uses it on this basis subcompositional data containing a two-dimensional vector replaces a one-dimensional vector as the initial condition vector,and by increasing the vector dimension as the subcompositional of the condition vector one by one,iterating continuously,and selecting variables for the high-dimensional compositional data,thus proposed PDC-PSIS(Aitchison)method.(2)Through numerical simulation,the untransformed PDC-SIS(crude),the untransformed PDC-PSIS(crude)and the transformed PDC-SIS(Aitchison)are directly used for variable selection based on partial distance correlation without transforming the apparent high-dimensional compositional data as ordinary data.Compare the effectiveness and scope of the PDC-SIS(Aitchison)and the PDC-PSIS(Aitchison).(3)Using the PDC-SIS(crude),the PDC-PSIS(crude),the PDC-SIS(Aitchison),and the PDC-PSIS(Aitchison)methods reported by Xie et al.(2016)on the abundance data of 250 pairs of adult twins Empirical analysis,and compare the results obtained with the existing research results and variable selection methods to evaluate the effectiveness and accuracy of the method.The research results of this thesis are as follows:(1)Because the PDC-SIS(crude)and the PDC-PSIS(crude)methods ignore the characteristics of compositional data,these two variable selection methods can hardly identify the important predictors related to response variables in the model.Since the PDC-PSIS(Aitchison)considers that the compositional data contains relative information rather than absolute information,the method can be used to effectively select important predictive variables in both linear and nonlinear models,and it also has certain applicability to group predictive variables.When the two random variables are independent,the PDC-PSIS(Aitchison)is better than the PDC-SIS(Aitchison)in the screening of important predictive variables.(2)The effect of the PDC-SIS(Aitchison)method on variable selection is mainly affected by the correlation between the components of the compositional data,so this method is suitable for compositional data sets with greater correlation..(3)In the comparison of the effect of compositional data variable selection methods,when the dimension is constant,the variable selection effect of the PDC-PSIS(Aitchison)method increases with the increase of compositional data correlation,and its variable selection effect is better than the PDC-SIS(Aitchison)and the DC-SIS(Aitchison);when the correlation is certain,the PDC-PSIS(Aitchison)is always better than the PDC-SIS(Aitchison)and the DC-SIS(Aitchison)in variable selection for compositional data of different dimensions. |