Font Size: a A A

Marginalized Two-part Beta Regression Models For Microbiome Data And Related Topics

Posted on:2019-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T ChaiFull Text:PDF
GTID:1360330572453607Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
This thesis concerns how zero-inflated data can be analyzed by two-part models.Zero-inflated means that there are so many zero values for a response variable such that a standard distribution can not be used to fit the variable.For example,let Y be the num-ber of emergency department visits for some patients in a particular region.And Y = 0 means that a patient made no emergency department visits.Actually,most of the pa-tients made no emergency department visits so there are many zero values for variableY.Usually,such a variable can be described by a Poisson distribution and the mean param-eter is estimated with the mean of the data.However,the number of zero values in this Poisson distribution is much smaller than that in the real data.Such a variable Y is called zero-inflated.Zero-inflated data are frequently encountered in biomedical,health service,economic,and ecological studies.More and more attentions are paid to zero-inflated data and their applications.Related research papers are published in many top journals like Jaournal of the American Statistical Assocatian,Nature,Nature Communications,Cell Host Microbe,Bioinformatics,Statistical Methods in Medical Research and so on.In the former example,the positive part of the variable Y is discrete count data.In many studies,however,the positive part of the interested variable is continuous.For example,for the alcohol dependence study,let Y denote the volume of daily drinks.Then the range of Y is[0,+?)and the positive part of Y is continuous.This kind of data is called semicontinuous.And there is another kind of semicontinuous data that ranges in[0,1)instead of[0,+?).Usually,this is the relative abundance of a subject in the sample.Such a kind of data is called compositional data and is the topic of this thesis.According the last two paragraphs,the distribution of zero-inflated data Y can be described as Y?0,with probability 1-P,(1)?f(y;?),with probability p;where f(y;?)is the distribution for the positive part and ? is the parameter of the distribution.In order to study the relationship between response Y and covariates x,two-part regression model is proposed.As is indicated by its name,there are two sub-models in a two-part model.The first sub-model regresses the probability p on the covariates with a logistic regression model and the second sub-model regresses the positive response on the covariates with a generalized linear model.Let ?=E(Y|Y>0)be the conditional mean given that Y is positive,then a two-part model has the following form:logit(pi)= xiT? = a0 + ?1xi1 + … + ?pxip,(2)g(?i)?xiT? = ?0+?1xi1 +…+ ?pxip,i = 1,…,n;where g(·)is the link function and model(2)is called conditional two-part model because conditional mean is used in the second sub-model.It is worth to note that the parameters ? in the model(2)represent the effects of the covariates on the conditional mean of Y.However,in many applications,what we are interested is the effects of the covariates on the unconditional mean of Y.For this purpose,a marginalized two-part model is proposed.Briefly,in the marginalized two-part model,the overall mean v = E(Y)instead of the conditional mean E(Y|Y>0)is regressed on covariates:logit(pi)= xiT?? ?0+?1xi1+…+?pxip,g(vi)= xiT?=?0+?1xi1 + … + ?pxip,i =1,… n.The marginalized model(3)is more interpretable in describing the relationship between overall mean and covariates than the conditional model(2).There are five chapters and an appendix chapter in this thesis.In the following,we summarize the main results of the first four chapters.Chapter 1:In this chapter,we introduce the basic knowledge of two-part model.And our introduction begins with real data which contains zero-inflated count data and zero-inflated semicontinuous data,then two-part regression models are incorporated to examine the effects of the covariates on the response.For example,for zero-inflated count data,there are Hurdle model,zero-inflated Poisson model,zero-inflated negative binomial model and so on.For the positive part of the zero-inflated semicontinuous data,log-normal is the most useful distribution.Then some useful extensions of the conditional two-part model are introduced.For example,marginalized two-part model can be used to examine the effects of covariates on the overall mean of response.The heterogeneity can be incorporated by regressing variance on the covariates.And random effects are introduced to describe the correlations between longitudinal responses.These are the main contents of the next three chapters.In the last,we show how to estimate the parameters in the two-part models.Chapter 2:The second chapter studies a special kind of semicontinuous data,namely,compositional data which ranges in[0,1).In medical researches,it is found that many diseases are related to the abundances of microbes in human body.In order to find the cause of a disease,researchers need to study the abundances of microbes in different groups,such as experiment group and control group.Originally,the abundances are measured by count data.However,due to the uneven total sequence counts of samples,the abundances measured in read counts are not comparable across samples.Therefore,it is common that the count data are normalized to relative abundances by dividing total sequence count in the sample.This resulting in compositional data with lots of zero values.Let Yi(0 ? Yi ? 1)denote the abundance of a microbe in the i-th sample,i =1,2,...,n.The following two-part model is commonly used to describe the distribution of Yi:Yi?0,with probability 1-pi,?Beta(?i?,(1-?i)?),with probability pi;(4)where pi is the probability of positive value and Beta(?i?,(1-?i)?)is the Beta dis-tribution with mean parameter ?i(0<?i<1)and dispersion parameter 0(0>0).Actually,?i is the conditional mean of Yi given that Yi is positive.In order to examine the effects of covariates Xi on the response Yi,conditional two-part model regresses pi and ?i on the covariates by generalized linear models:logit(pi)log(pi/1-pi)= XiT?,logit(?i)= log(?i/1-?i)=XiT?.However,in many applications,the interests lie in the relationship between covari-ates and the overall mean E(Yi).Thus,we propose the following marginalized two-part model:logit(pi)= log(pi/1-pi)=XiT?,logit(vi)= log(vi/1-vi)= XiT?.Under the frame of the conditional model,whether the overall mean is independent with a continuous covariate xij or not is tested by testing the hypothesis ?j = 0,?j = 0.However,these two conclusions are not equivalent because(?)/(?)xij(logit[E(Yi))= c1(?j,?j)?j + c2(?j,?j)?j.(7)Under the frame of the marginalized two-part model,whether the overall mean is inde-pendent with a continuous covariate xij is determined by the coefficient ?j:(?)/(?)xijlogit[E(Yi)])=?j.(8)For the discrete covariate,the results are similar.It can be seen from equation(7)that sometimes conditional model can not control the type I error and this is verified by the simulation studies.The simulation results show that both the conditional model and the marginalized model are powerful.However,the conditional model sometimes fails in controlling the type I error while the marginalized model always control the type I error well.Besides,the marginalized model has good performance in estimating and it is robust.Last,the proposed model is applied to a real data and the results show that the new model performs better than the conditional model.Chapter 3:In model(5)and model(6),it is assumed that the dispersion parameter? is constant.However,we find in the real data that this is not true.For example,for the real data in Chapter 2,the dispersion parameters of 131 OTUs range from 0.67 to 1219.35.And this is the motivation that lead us to incorporating heterogeneity in the two-part model.There are many ways to incorporating heterogeneity,what we used in this chapter is regress the dispersion parameter on the covariates directly.Therefor,we have the following conditional two-part model with heterogeneity:logit(pi)= log(pi/1-pi)= XiT?,logit(?i)= log(?i/1-?i)=XiT?,(9)?i = exp ?Xit??;and the marginalized two-part model with heterogeneity:logit(pi)= log(pi/1-pi)= XiT?,logit(vi)=log(Vi/1-vi)= XiT?,(10)?i = exp?XiT??.Then the performance of the models with heterogeneity is evaluated by simulation studies and the results show that models with heterogeneity performs betters for data with or without heterogeneity.While the models without heterogeneity performs poorly for data with heterogeneity.And for conditional model and marginalized model,it is shown that only marginalized model can examine the relationship between overall mean and covariates successfully.In the last,the new models are applied to the real data in chapter 2 and resulting in more precise results.The results in this chapter show that marginalized two-part model with hetero-geneity is the proper choice if it is not sure whether the real data is heterogeneous or not.Chapter 4:In this chapter,we focus on the longitudinal data where correlations exist.As a result,random effects are incorporated to modelling the correlations and resulting in two-part model with random effects.For the conditional model,we can incorporate random effects as follows:logit(pij)= log(pij/1-pij)= xijT?+ai,(11)logit(?ij)= log(?ij/1-?ij)=xijT? + bi;where ai and bi are random intercepts and their distributions are:ai?N(0,?a2),bi?N(0,?b2).Similarly,the marginalized two-part model with random effects is:logit(pij)= log(pij/1-pij)=xijT?+ci,(12)logit(vij)= log(vij/1-vij)= xijT?+ di;where ci and di are random intercepts and their distributions are:ci?N(0,?c2),di?N(0,?d2).In the conditional model with random effects,the effect of a continuous covariate xijl on the overall mean can be described as:(?)E(Yij)/(?)sijl? ?l·?1(?,?)??l·?2(?,?).(13)The result shows that the independence between E(Yij)and Xijl is not equivalent to al = 0,?l = 0.In the marginalized model with random effects,we have:(?)E(Yij)/(?)ijl=?l·?1(?).(14)It is shown in equation(14)that the effect of covariate xijl on the overall mean is determined by the coefficient ?l.The results are similar for discrete covariate.Both the theoretical results and the simulations show that the marginalized model with random effects can examine the relationship between covariates and overall mean.Last,the proposed new model is applied to a real data to evaluate its performance.
Keywords/Search Tags:two-part model, zero-inflated data, semicontinuous, compositional da-ta, Beta regression, marginalized model, heterogeneous, longitudinal data, random effects
PDF Full Text Request
Related items