Font Size: a A A

Nonparametric Bayes models for high-dimensional and sparse data

Posted on:2011-06-13Degree:Ph.DType:Dissertation
University:Duke UniversityCandidate:Yang, HongxiaFull Text:PDF
GTID:1440390002950276Subject:Statistics
Abstract/Summary:
Chapter 1. In Chapter 1, we review the Dirichlet process (DP) in detail. There are many other ways of nonparametric modeling, but with the availability of efficient computation and complete set up of theories, the DP is most popular and has been developed and studied extensively. We will also review the most recent development of the DP in this chapter.;Chapter 2. We propose the multiple Bayesian elastic net (abbreviated as MBEN), a new regularization and variable selection method. High dimensional and highly correlated data are commonplace. In such situations, maximum likelihood procedures typically fail---their estimates are unstable, and have large variance. To address this problem, a number of shrinkage methods have been proposed, including ridge regression, the lasso and the elastic net; these methods encourage coefficients to be near zero (in fact, the lasso and the elastic net perform variable selection by forcing some regression coefficients to equal zero). In this paper we describe a semiparametric approach that allows shrinkage to multiple locations, where the location and scale parameters are assigned Dirichlet process hyperpriors. The MBEN prior encourages variables to cluster, so that strongly correlated predictors tend to be in or out of the model together. We apply the MBEN prior to a multi-task learning (MTL) problem, using text data from the Wikipedia. An efficient MCMC algorithm and an automated Monte Carlo EM algorithm enable fast computation in high dimensions. The methods are applied to Wikipedia data using shared words to predict article links.;Chapter 3. Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.;Chapter 4. In studies involving multi-level data structures, problems of data sparsity are often encountered and it becomes necessary to borrow information to improve inferences and predictions. This article is motivated by studies collecting data on different outcomes following congenital heart surgery. If there were sufficient numbers of patients receiving each type of procedure, one could potentially fit procedure-specific multivariate random effects model to relate the outcomes of surgery to patient predictors while allowing variability among hospitals. However, as there are approximately 150 procedures with many procedures conducted on few patients, it is important to borrow information. Allowing variability among hospitals, procedures and outcome types in the regression coefficients relating patient factors to outcomes, we obtain a three-way tensor of regression coefficient vectors. To borrow information in estimating these coefficients, we propose a Bayesian multiway tensor co-clustering model. In particular, the model works by reducing the dimension of the table through separately clustering hospitals, procedures and outcome types. This soft probabilistic clustering proceeds via nonparametric Bayesian latent class models, which favor clustering of dimensions that have similar values for feature vectors. Efficient MCMC and fast approximation approaches are proposed for posterior computation. The methods are illustrated using simulated data, and applied to heart surgery outcome data from a Duke study. (Abstract shortened by UMI.)...
Keywords/Search Tags:Data, Nonparametric, Model, Efficient MCMC, Chapter, Using
Related items