Font Size: a A A

Structure Identification,Variable Selection And Robust Estimation For Some Semiaparametric Models With High Dimensional Complicated Data

Posted on:2017-02-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:K N WangFull Text:PDF
GTID:1220330485979593Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
In many areas of modern scientific research, such as medical research, agricultural research, social survey, economics, biology and epidemiology, and other fields, the com-plicated data such as longitudinal data and missing data are often encountered. With the improvement of data collection capacity and the reducing of cost, and the rapid development of data storage technology, the dimension of the data becomes more and more large. Furthermore, because the semiparametric models can overcome the "curse of dimensionality" and the risk of model misspecification, it has been widely researched and used.In chapter 2, we study the simultaneous partial linear structure identification and variable selection issue for the partial linear varying coefficient models with longitudinal data. The partial linear varying coefficient models are often considered for analysing longitudinal data for a good balance between flexibility and parsimony. The existing statistical inference methods for this model are mainly built upon the following assump-tion:the subset of variables having constant or varying effect on the response is known in advance, i.e., assume the original covariate can be partitioned into two subsets and where xij(1) has varying effect and xij(2) has constant effect, then all statistical inference problems are built upon the following modelIn fact, the partial linear structure assumption is fundamentally important, as the va-lidity of the fitted model and its inference heavily depend on whether the partial linear structure is specified correctly or not. However, in the application, it is unreasonable to artificially determine which subset of variables have constant or varying effect on the re-sponse. Furthermore, the set of relevant variables and the types of effects of the relevant variables on the response may vary at mean and different percentiles of the distribution. For instance, in chapter 2.4, the analysis of a longitudinal AIDS data set shows that the effect of the PreCD4 is time-varying at lower quantiles, whereas it tends to be constant in the upper quantiles and mean. What is more, the data may be contaminated by out-liers, thus robustness is necessary. Note that, the semiparametric partial linear varying coefficient models with underlying true structure has the following form where O(·) is the zero function, unknown sets AV, AC and AZ are the index sets for varying effects, nonzero constant effects and zero effects, respectively, they are mutually exclusive and satisfy AV∪AC∪AZ={1,…,p}.Based on the idea of penalization type variable selection methods and a general M-type loss function, which can treat mean, median, quantile and robust mean regressions in a unified framework, this chapter proposes a penalized M-type regression, which can do simultaneous nonzero coefficients estimation and three types of selections:varying and constant effects selection, relevant variables selection (i.e., identify the unknown sets AV, AC and AZ).It can be easily implemented in one step, and by choosing different loss function, it can explore more information between the response and covariate, and can be robust to the outliers. Under some mild regularity conditions, consistency in the three types of selections and oracle property in estimation are established as well. Here, selection consistency means that the probability of the new method can correctly identify the varying coefficients, constant coefficients and relevant variables tends to one, i.e., where AV, AC and AZ are estimators of AV, AC and AZ respectively. The oracle property indicates that the resulting varying coefficients estimates possess the optimal convergence rate, and the constant coefficient estimates have the same asymptotic normal distribution as their counterparts obtained under the true model. Simulation studies and real data analysis also confirm our method.Chapter 3 considers robust and efficient direction identification issue for the follow-ing groupwise additive multiple-index models where gk(·) is an unknown component function, Y ∈R be a response variable, X∈Rp be a p-dimensional predictor vector, and assume that X can be naturally divided into K non-overlapping groups, i.e., is a single-index vector of interest corresponding to Xk, and the random error (?) is independent of X. This model also was considered in Wang et al. (2015). Clearly, when K=1, it reduces to the classical single index model. If K=2 and g1(u)=u, then it is the partially linear single-index model. When the function gk(·) is unspecified, the index parameter βk is not identifiable. Thus, the direction of βk, rather than its true value, is of primary interest. Define the p x K where 0pk×1 is pk×1 zero matrix, k=1,…,K. Obviously, for the above groupwise additive multiple-index models, Y and X are independent conditional on PT X. The column space of P is called the central dimension reduction subspace (Li 1991, Wang et al.2015). Under the following linearity condition: Wang et al. (2015) showed that, the linear least square solution has the same direction with (β1T,…,βKT)T,i.e., for some φk∈R, k=1,…,K. However, it is well known that, the least square solution is adversely influenced by outliers and heavy tail distributions. Thus it is worthy to remedy these weaknesses, and it is of great interest to see whether the robust and efficient composite quantile regression (CQR) based method (Zou and Yuan 2008, Kai et al.2011) can be used in this setting. Interestingly, similar to the least square solution βLS,without involving nonparametric approach, we show simple linear CQR coefficient for Y|X can offer a consistent and asymptotical normal estimate for the directions of all the index parameter vectors. Specially, assume that,0<τ1<τ2<…<τq<1, b=(b1,…, bq)T,η=(η1T,…,ηKT)T, ηk=(ηk1,…,ηkpk)T, k=1,…,K, define the following population version linear CQR loss as where is the indicator function. Let (b,η)=arg minb,η(b,η), under the linearity condition, we first prove that η= {η1T,…,ηKT)T is in the column space of P, i.e., for some κ=(κ1,…,κK)T∈RK. This implies that the directions of K index vectors βk, k=1,…,K can be identified by the simple linear CQR without employing any nonparametric techniques. Next, based on the sample{Xi,Yi}i=1n, the sample version of L(b,η) is defined as Then, let (b,η)=arg minb,η Ln (b,η), and η= (η1T,…, ηKT)T is the linear CQR esti-mator of η. We further prove the asymptotic normality property of the new estimator η, and obtain the bootstrap approximations to the distributions of the estimator. As specific applications, a iterative-free CQR estimation procedure for the partially linear single-index model is proposed, and the asymptotic properties are established. Further-more, for the variable selection in the sparse and high-dimensional settings, it can also be used to develop a penalized CQR with the form where pλ(·) is penalty function, λ is the nonnegative tuning parameter, and we focus on two commonly used nonconvex penalties:the SCAD (Fan and Li 2001) penalty and the MCP (Zhang 2010). In the p>>n setting, the oracle property is established, i.e., where Bn(λ) is the set of local minima of the nonconvex penalized CQR objective function Qλ(b,η), with tuning parameter λ, and η° is the oracle estimator which is obtained under the true model. Here, the oracle property indicates that the resulting estimator is the oracle estimator itself rather than just mimicking the performance of the oracle estimator. The new method in this chapter has superiority in robustness and efficiency by inheriting the advantage of the CQR approach. Simulation results and real data analysis also confirm our method.In chapter 4, we consider the following d-dimensional estimating function: where θ=(θ1,…,θp)τ is a p-dimensional parameter to be estimated, Qk(θ,y,x), k= 1,…, d are given functions, which may be nonlinear with respect to θ, and τ denotes the transpose of a vector. Suppose the estimating function is conditionally unbiased, i.e., there is a unique solution θ0 such thatWhen a conditional estimating function is nonlinear and the data are incomplete, we are faced with two difficulties:the nonidentifiability of complete case analysis and ineffi-ciency of nonparametric imputation. To tackle these issues, we in chapter 4 define a full imputation smooth distance and then suggest a smooth minimum distance estimation for the parameter in the model. The method can uniquely identify the parameter in the nonlinear model, and the resulting parameter estimator is always (?) consistent and asymptotically normal for a fixed, non-vanishing bandwidth, as well as a vanishing one, although a kernel estimator is used during the intermediate procedure. Specially, under some mild conditions, for arbitrary ho> 0, the resulting estimator θn,h* satisfies that where h and n are bandwidth and sample size respectively. Furthermore, converges in distribution to a tight random process indexed by h whose marginal distributions are zero-mean normal distributions, uniformly for h ∈Hn={h0≥h>0: nh4p/α≥C}, where C>0,0<α<1. As such, the new method is flexible when the model is nonlinear and the involved variables are multi-dimensional.The method proposed in the chapter 2 has the following aspects of defects:first, when the response yij is discrete, it can not be used, second, although the longitudinal data is involved, it does not incorporate the within-subject correlation structure, this will cause the loss of estimation efficiency, third, although the median and robust mean regressions are robust to the outliers, they have limitation in terms of efficiency of esti-mation. Wang et al. (2013) proposed a exponential squared loss 1 - exp(-r2/h), with score function Note that φh(r) is also a bounded score, since Here, h> 0 controls the degree of robustness and efficiency for the estimator. Specially, for large h,1 - exp(-r2/h)≈r2/h, and therefore the proposed estimators are similar to the least squares estimators in the extreme case. For a small h, the large values of |r| will not result in a large impact on the estimators. Hence, a smaller h would limit the influence of outliers on the estimators. Wang et al. (2013) pointed out that φh(r) was more robust than other existing robust methods, e.g., Huber’s estimate, quantile regression (Koenker and Bassett 1978), composite quantile regression (Zou and Yuan 2008), etc. In chapter 5, we mainly focus on the semiparametric generalized partial linear varying coefficient models with longitudinal data whose underlying true structure has the following form where g-1(·) is a given link function. By using the score φh(r) and the idea of the gen-eralized estimation equation, for the generalized partial linear varying coefficient models with longitudinal data, chapter 5 proposes a new robust and efficient estimator, which can construct variable selection and partial linear structure identification simultaneously. What is more, it can overcome the defects mentioned above. The new method is built upon a newly proposed smooth-threshold robust and efficient generalized estimating e-quations, which can use the within subject correlation structure, and achieves robustness against outliers both in the response and the covariate by using bounded exponential score function φh(r) and leverage-based weights. By introducing an additional tuning parameter h, it has balance between robustness and efficiency. Under mild conditions, we prove that, with probability tending to one, it can select the relevant variables and identify the partial linear structure correctly. Furthermore, the varying and nonzero constant coefficients can be estimated accurately, just as the true model structure and relevant variables were known in advance. Simulation studies also confirm our method.
Keywords/Search Tags:Semiparametric models, Variable selection, High dimensional data, Partial linear structure identification, Robust estimation, Longitudinal data, Missing data
PDF Full Text Request
Related items