Comparison Of Statistical Inference Methods Based On Lasso For High Dimensional Data In Linear Regression Models

Posted on:2016-07-13

Degree:Master

Type:Thesis

Country:China

Candidate:J Q Zhao

Full Text:PDF

GTID:2284330479492980

Subject:Epidemiology and Health Statistics

Abstract/Summary:

PDF Full Text Request

Objective:This paper will introduce five statistical inference methods based on Lasso for high dimensional data in linear regression models: Lasso Penalized Score Test(Lassoscore)ã€Multiple Sample-Splitting(MS-split)ã€Stability Selectionã€Low-Dimensional Projection Estimate(LDPE)ã€Covariance test(Covtest), compare the five methods and analyze their performances in different high dimensional cases.Methods:Introduce the fundamental theory of the five statistical inference methods respectively: Lassoscoreã€MS-splitã€Stability Selectionã€LDPE and Covtest. Simulate data is set by the four parameters: seven kinds of sample size n=50ã€75ã€100ã€150ã€200ã€300ã€400; Two kinds of variable number p=100ã€300; Two kinds of correlation between variables: one is independent of each other, the two is corr(Xi,Xj)=0.5|i-j|; Two kinds of regression coefficient: one is Î²1=Î²2=Î²3=Î²4=Î²5=5,Î²j=0, j>5, the two is Î²1=Î²2=Î²3=Î²4=Î²5=0.15,Î²j=0, j>5. The four parameters construct different cases of high dimensional data. Simulate data and use the five methods to infer statistical significance by R software. Finally,expected false positives(EFP) and power will be as evaluation index to compare the performances of the five methods in different high dimensional data cases.Results:The performance of the five methods are all well in the ideal high dimensional data cases, except Covtest. The performance of Stability Selection is the best in the five methods, its EFP is the lowest but power is the highest. LDPEã€Stability Selectionã€MS-Split have requirements for Î²min condition. Among them, Stability Selection much depends on Î²min condition, its power reduced greatly in complex high dimensional data. No matter the sample size is large or small, LDPE is conservative in the complex high dimensional data case. LDPEâ€™s power is high in a medium sample size but importing extremely high false positive. No matter in what high dimensional data cases, the inference results of Covtest are conservative. In the complex cases of high dimensional data,Lassoscoreâ€™s power is the highest in the five methods, MS-Splitâ€™s followed, but Lassoscoreâ€™s EFP is also highest instead MS-Splitâ€™s EFP is close to 0.Conclusions:In the common complex cases of high dimensional data, Lassoscoreâ€™s ability to discover the true non-zero variables is better than the other four methods, and its requirement of Î²min condition is low, but the expected false positive rate of Lassoscore is high. MS-Splitâ€™s ability to discover the true non-zero variables depends on whether the data satisfy Î²min condition, and its ability is second only to Lassoscore when the condition is not satisfied. The expected false positive rate of MS-Split is very low. In summary,Lassoscore and MS-Split are better methods of statistical inference methods based on Lasso for high dimensional linear regression in common complex high dimensional data.Relatively, the former is loose and the latter is conservative. Although it is unknown that whether the Î²min condition is satisfied in practical application, suitable statistical inference methods can be selected according to the actual demand.

Keywords/Search Tags:

high dimensional data, Lasso, statistical inference, linear regression

PDF Full Text Request

Related items

1	Research On Statistical Methods And Application For Detecting The Group Difference Between Networks In Systems Epidemiology
2	Statistical Analysis Of Varying-coefficient Partially Functional Linear Quantile Regression Models Based On Laryngoscope Image And Censored Data
3	Research On Prediction Method Of Survival Time Of Lung Cancer Patients Based On Causal Inference
4	Assessing Alcohol Consumption Effect On Epithelial Ovarian Cancer Mediated By DNA Methylation Based On A High-dimensional Causal Inference Test
5	Variables Screening Of Ultra-high Dimensional Single-index Model With Missing Data
6	Statistical Methods For Interaction Analysis In High Dimensional Data And The Application In Genome-wide Association Study Of Lung Cancer
7	Statistical Analysis Of Comprehensive Hiv/aids Knowledge And Acceptance Attitude Among Young Women In Kenya And Lesotho
8	Research On Feature Selection Method For Chinese Medicine Metabolomics Data Based On Lasso
9	Several Statistical Issues In The Analysis Of High-dimensional Biological Data
10	Bayesian Multi-Locus Model In High Dimensional Omics Data