Objective:This paper will introduce five statistical inference methods based on Lasso for high dimensional data in linear regression models: Lasso Penalized Score Test(Lassoscore)ã€Multiple Sample-Splitting(MS-split)ã€Stability Selectionã€Low-Dimensional Projection Estimate(LDPE)ã€Covariance test(Covtest), compare the five methods and analyze their performances in different high dimensional cases.Methods:Introduce the fundamental theory of the five statistical inference methods respectively: Lassoscoreã€MS-splitã€Stability Selectionã€LDPE and Covtest. Simulate data is set by the four parameters: seven kinds of sample size n=50ã€75ã€100ã€150ã€200ã€300ã€400; Two kinds of variable number p=100ã€300; Two kinds of correlation between variables: one is independent of each other, the two is corr(Xi,Xj)=0.5|i-j|; Two kinds of regression coefficient: one is β1=β2=β3=β4=β5=5,βj=0, j>5, the two is β1=β2=β3=β4=β5=0.15,βj=0, j>5. The four parameters construct different cases of high dimensional data. Simulate data and use the five methods to infer statistical significance by R software. Finally,expected false positives(EFP) and power will be as evaluation index to compare the performances of the five methods in different high dimensional data cases.Results:The performance of the five methods are all well in the ideal high dimensional data cases, except Covtest. The performance of Stability Selection is the best in the five methods, its EFP is the lowest but power is the highest. LDPEã€Stability Selectionã€MS-Split have requirements for βmin condition. Among them, Stability Selection much depends on βmin condition, its power reduced greatly in complex high dimensional data. No matter the sample size is large or small, LDPE is conservative in the complex high dimensional data case. LDPE’s power is high in a medium sample size but importing extremely high false positive. No matter in what high dimensional data cases, the inference results of Covtest are conservative. In the complex cases of high dimensional data,Lassoscore’s power is the highest in the five methods, MS-Split’s followed, but Lassoscore’s EFP is also highest instead MS-Split’s EFP is close to 0.Conclusions:In the common complex cases of high dimensional data, Lassoscore’s ability to discover the true non-zero variables is better than the other four methods, and its requirement of βmin condition is low, but the expected false positive rate of Lassoscore is high. MS-Split’s ability to discover the true non-zero variables depends on whether the data satisfy βmin condition, and its ability is second only to Lassoscore when the condition is not satisfied. The expected false positive rate of MS-Split is very low. In summary,Lassoscore and MS-Split are better methods of statistical inference methods based on Lasso for high dimensional linear regression in common complex high dimensional data.Relatively, the former is loose and the latter is conservative. Although it is unknown that whether the βmin condition is satisfied in practical application, suitable statistical inference methods can be selected according to the actual demand. |