Font Size: a A A

Ultra-high Dimensional Feature Selection And Mean Estimation Under Random Missing Mechanism

Posted on:2022-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:W H LiFull Text:PDF
GTID:2480306521453514Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
With the breakthrough of computer and storage technology,researchers in financial market,genetics,structural chemistry and other fields can easily obtain ultra-high dimensional data.More and more attention has been paid to the application and development of ultra-high dimensional data analysis technology.The main difficulties are high dimension and data missing.Due to computational cost,statistical accuracy and algorithm stability,mature highdimensional statistical methods cannot be directly applied to ultra-high dimensional data.Therefore,it is an important research topic to make effective statistical inference when the covariate dimension is very high and the response variable is missing.Based on the previous two-stage ultra-high dimensional feature screening,this paper improves the measurement method of variable importance stage,and proposes the random forest deterministic independence screening method without model setting(rf-sis).The rf-sis method is used to reduce the dimension of ultra-high dimensional data,and on this basis,the inverse probability weighting method(IPW)and regression interpolation(MI)are used to get the mean estimation under the random missing mechanism method.Firstly,rf-sis improves the importance measure in the first stage of SIS screening from marginal correlation coefficient to the change of mean square error after rearrangement of samples out of the bag in random forest regression.According to the change of mean square error,the covariates are arranged in descending order and then truncated to obtain the highdimensional candidate variable set.In the second stage,Lasso is used to screen the highdimensional variable set to obtain the final variable set.In order to test the screening performance of rf-sis,based on the SIS screening method,five groups of simulation experiments are designed according to the model complexity.The numerical simulation shows that rf-sis can correctly identify the real variables in the ultra-high dimensional model with high complexity.Secondly,the dimension of the variable set obtained by rf-sis may still be high.In order to improve the signal-to-noise ratio,the full dimensionality reduction method is used to compress the variables.On this basis,IPW and MI are used to weight the missing data and regression interpolation respectively,and the processed samples are calculated to get the mean estimation.In order to test the estimation performance of IPW and Mi methods,six groups of experimental simulations were designed according to different parameter settings.The simulation results show that the correlation of covariates has limited influence on the estimation results.When the missing proportion of response variables is less than 60%,IPW and MI can get an effective estimation of the mean value;when the missing proportion is too high,IPW and Mi will lose their original good properties,which is in line with the previous research that most interpolation methods will fail when the missing proportion is too high.Finally,in order to investigate the performance of rf-sis and mean estimation,we used rfsis to screen the related genes of tumor suppressor gene TP53 in the ovarian cancer gene array of TCGA database.The results showed that some of the genes screened by rf-sis were consistent with previous studies.In the mean estimation part,IPW and Mi methods are used to estimate the mean survival time of patients.The results show that there is little difference between the estimated values,but the length of the confidence interval of MI estimation is short and the estimation is more accurate,which can effectively solve the problem of large estimation deviation of direct deletion method.
Keywords/Search Tags:Ultrahigh-dimensional data, Missing data, Sure independent screening, Mean estimation
PDF Full Text Request
Related items