| Count data is very common in our daily life.These data widely appear in medical experiments,transportation departments and economic departments.For example,the number of heartbeats of arrhythmia patients,the number of calls received by a call center within a certain period of time,the number of customers entering a mall within one day and the number of traffic accidents in a crossroad during a period of time,etc.This thesis uses ventricular contraction data from the same arrhythmia patients before and after the use of a new drug.It leads to the problem of how to compare groups of correlated or independent count data.People have been interested in comparing groups of count data.In this case,our goal is to test whether this new drug has significant effect,so we need to compare the two groups or counting data.Without loss of generality,we want to compare two or more groups of count data.These data can be correlated,most of them appear as paired data,that is,the records of the same batch of individuals or pre-selected subjects under different experimental methods,the amount of data in each group is the same;these data can also be independent and the number of each group may vary.Research on groups of count data can help people make judgments and even make decisions in real life and have sufficient theoretical basis.For example,to test whether a new drug has the expected effect,and whether the monitoring device can significantly reduce the number of traffic accidents in the crossroad and whether the promotion can significantly increase the number of customers entering the mall,etc.Statistical signifi-cance can provide sufficient evidence to avoid misjudgment rather than relying solely on subjective experience or intuition.Count data is usually modeled using a Poisson distribution,but Poisson distribution requires that the expectation of the variable is equal to the variance,which is often not met in real life.Many count data will have a variance greater than expectation,which is called overdispersion.The negative binomial distribution can handle the problem of overdispersion well because it contains a parameter to model the relationship between the variance and the expectation.In addition,there are other distributions for specific data.For example,the number of dentist visits,most people do not go to the dentist every year if not necessary,because it is very expensive and troublesome,which leads to excessive zeros in the data.Poisson distribution and negative binomial distribution are unable to model such data.Therefore we need to consider zero-inflated distributions such as zero-inflated Poisson distribution and zero-inflated negative binomial distribution.These modified distributions tend to perform better than the original distributions and obtain better results.For a comparison between specific groups of count data,we can build a model to solve this problem specifically.However,such a method is too time-consuming and labo?rious.Therefore,we mainly use regression analysis in this thesis since it is very general and the results are intuitive and easy to understand.Performing statistical inference and hypothesis test are also convenient.In this thesis,the group acts as an important factor in the regression analysis.For count data,Poisson regression and negative binomial re-gression are very common models.They can qualitatively analyze the influence of group on response variable.For zero-inflated data,we also establish zero-inflated Poisson re-gression and zero-inflated negative binomial regression and compare them with simple regressions.We find that the zero-inflated models work better In this thesis,we also describe how Poisson regression and negative binomial regression and the corresponding zero-inflated models use the EM algorithm or the Newton-Raphson algorithm to perform parameter estimation.The above generalized linear models can well fit the count data of independent groups.For paired count data,we notice that there may exist some subtle fluctuations of each subject due to its own reasons in multiple measurements.For exam-ple,the measurement of the number of heartbeats may fluctuate due to the individual’s own reasons,so,when modeling paired count data,we must consider the fluctuations of the individual under multiple measurements,otherwise it may lead to large errors in the regression results.For correlated groups of count data,this usually behaves as records for the same in-dividuals under different experimental methods,for example,the number of heartbeats before and after taking the drug.We introduce random effects in the regression model to explain the fluctuations of each individual.We assume that the random effects are inde-pendently distributed from the normal distribution with mean zero and variance unknown.Given the random effects,we assume that the response variable are independent of each other and follow the Poisson distribution or the negative binomial distribution.And it-s expectation is linked to covariates and random effects through a link function.In this way,we fully consider the possible fluctuation of each individual.The generalized linear model(GLM)with random effects is called the generalized linear mixed model(GLMM).For the parameter estimation of the GLMM,since the random effects cannot be ob?served and the analytic expression of the likelihood function cannot be obtained,the pa-rameter estimation cannot be directly performed by the EM algorithm or the Newton-Raphson algorithm.Therefore,in this thesis,the Monte Carlo method,combined with the EM algorithm and Newton-Raphson algorithm,is adopted to estimate the parameters.As the random effects cannot be observed,we regard it as missing data in the EM algorithm.In the E step,we generate samples from its posterior distribution and calculate the value of the likelihood function.In the M step,we update the parameters.This iterative process continues until the predetermined convergence condition is satisfied and we obtain the maximum likelihood estimation(MLE)of the parameters.The posterior distribution ex-pression of the random effects is very complicated,which makes it impossible to generate samples directly from the posterior distribution.Therefore,we need some other methods to carry out sampling.The acceptance-rejection method is a commonly used sampling method,where indirect extraction is performed through another probability density that is easier to sample,and it does not require knowledge of the complete probability density.As used in this thesis,it is feasible in practice,although sometimes it is time-consuming.We first carry out some numerical experiments.The original hypothesis is that there is no significant difference between the groups while the alternative hypothesis is that there is a difference between the groups.The data generation mechanism is to extract samples from the Poisson distribution and add random effects from the normal distribution to them.GLM without random effects and GLMM are used to model the simulated data.Then we compare the type I error and the type II error of different models where the type II error is measured by the power.We find that models with random effects have better performance both in the type I error and the type II error.In addition,these models are consistent with each other with respect to the significance for the group.This shows that it is very reasonable and necessary to add random effects to the model when modeling paired count data.On several real datasets,we compare traditional generalized linear models,includ-ing Poisson regression and negative binomial regression,and generalized linear regression models with random effects.We find that different models are consistent with respect to the significance for the group.The GLMM can also identify the variance of the random ef-fects.This shows that in real situations,the random effects caused by the same individual’s fluctuations do exist.In the groups of correlated count data,especially the paired data,the random effects should be considered.Some of these real datasets are zero-inflated,so we establish Poisson regression,negative binomial regression,zero-inflated Poisson regres-sion and zero-inflated negative binomial regression.And the results between different regression models are compared.The results show that for data with excess zeros,the zero-inflated models are much better than simple regression models.For the comparisons between different models,we mainly use the information criteria.The information crite-ria are based on the value of the likelihood function with penalty for the number of the samples and parameters.At last,we summarize the thesis and give some discussions.The discussions are mainly obout the random effects.First,in this thesis,we assume that the random effects of different individuals are independent and identically distributed from the normal dis-tribution with only one parameter to be estimated,which is its standard deviation.We can also assume that the random effects of different individuals are from normal distribu-tions with different variance,then the different variance can be estimated.Similarly,we can also assume that the random effects are distributed from other distributions,gamma distribution is another choice.Secondly,in this thesis,we assume that different individuals are independent of each other,that is,their random effects are independent.In real situations,there may be some correlations between different individuals.Therefore,we can assume that random effect-s have a relationship with each other.For example,multivariate normal distribution or other similar multivariate distribution can be adopted.By introducing the correlated ran-dom effects,we can model the relationship of different subjects and get better and more reasonable results. |