Font Size: a A A

Studies On Balanced Estimation And Adaptive Projection Inference In High Dimensional Statistical Learning

Posted on:2022-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1480306314955219Subject:Statistics
Abstract/Summary:PDF Full Text Request
Due to the highly developed science and technology,high-dimensional statisti-cal learning is becoming more and more frequent and important in various fields of sciences,engineering,and humanities,ranging from molecular biology and health sci-ences to economics,finance and artificial intelligence.The high dimension here means that the dimension of the unknown parameters is much larger than the sample size,and learning refers to revealing the hidden information behind the data.The three core is-sues in high-dimensional statistics are variable selection,estimation and inference.In recent years,due to the unremitting efforts by researchers and practitioners in statis-tics,high-dimensional methods have emerged one after another and gradually matured.However,the existing literature generally believes that the data is clean,that is,not affected by measurement errors.In fact,measurement errors generally exist in high-dimensional data,such as sensor network data,high-throughput sequencing data,and gene expression data.Naively applying the methods designed for clean data sets to an-alyze corrupted data will result in inconsistent and unstable estimates,thus leading to incorrect conclusions.In addition,although the de-biased methods have greatly pro-moted the development of statistical inference,inference via the de-biased estimators typically requires a large sample size to guarantee the asymptotic normality and allows a relatively small number of nonzero coefficients above the identifiable level.Therefore,the de-biased method may not perform well in some practical applications.This article will be based on the above two issues.Aiming at the problem of variable selection and estimation in the measurement error model,this paper proposes a balanced estimator,where the balance refers to the trade-off between prediction,variable selection and computational efficiency.It com-bines the strengths of the nearest positive semi-definite projection and the combined L1 and concave regularization,and thus can be efficiently solved through the coordi-nate optimization algorithm.We also provide theoretical guarantees for the proposed methodology by establishing the oracle prediction and estimation error bounds equiva-lent to those for the Lasso with the clean data set,as well as an explicit and asymptoti-cally vanishing bound on the false sign rate that controls overfitting,a serious problem under measurement errors.Due to the non-convex optimization problem of this method,this paper also theoretically guarantees the appealing properties of the computable so-lutions.Our numerical studies show that the amelioration of variable selection will in turn improve the prediction and estimation performance under measurement errors.In order to alleviate the constraints in the de-biased methods(sample size and the number of nonzero coefficients above the identifiable level)and improve the efficiency of inference,we develop a new inference procedure via an adaptive projection estima-tor,which is based on the adaptive orthogonalization vector.This orthogonalization vector is adaptive in that it is orthogonal to the other covariate vectors corresponding to the identifiable coefficients,and at the same time being a relaxed orthogonalization against the remaining unidentifiable covariates.In this way,it completely removes the impacts of identifiable coefficients and controls that of the unidentifiable ones at a ne-glectable level,yielding much weaker constraint on both the sample size and the number of nonzero coefficients.In addition,we also provide a stable version of the method and extend it to the general generalized linear models(GLMs).In theory,we strictly prove the asymptotic normality of the adaptive projection estimator.Morevoer,a large num-ber of simulations further prove the superiority of the proposed method.Finally,we analyze the diabetes data and the stock data.The analysis of the dia-betes data shows that body mass index has the strongest positive correlation with the progress of diabetes.The analysis of the stock data shows that the four companies GAPTQ,GCO,HAR and OMS all have strong influence in their respective fields.
Keywords/Search Tags:High-dimensional statistical learning, Variable selection, Estimation, Inference, Measurement errors, Nearest positive semi-definite projection, Adaptive projection, Bias correction
PDF Full Text Request
Related items