Font Size: a A A

Comparisons Of Multiple Two-way Contingency Tables With Dependent Structure

Posted on:2021-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:J X QiuFull Text:PDF
GTID:2370330611997973Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Contingency table is a common data storage format,in which the data is the frequency after the observation data is classified by two or more attributes.Contingency tables are often used in disciplines such as medicine,biology,and social sciences.The statistical analysis of the contingency table can be applied to investigate whether there is a connection between two variables,that is,whether the two attribute variables are independent.The are various statistical tests for testing the dependency between two variables of a contingency table,such as Pearson's Chi-squared test,Fisher's exact test[2]and Cochran-MantelHaenszel test[1],etc.These hypothesis testing methods can be used to assess whether there is sufficient evidence to reject the original hypothesis that the variables are independent of each other.The purpose of this thesis is to develop a model which can not only be used to propose test of independence but also be applied as a measure of association between two random variables.The main idea is to propose a bivariate binomial distribution by introducing a dependency parameter ? which can measure the correlation of variables.Then we apply the model to the statistical analysis of two-way contingency table.In the introduction,we have a detailed literature review about bivariate binomial distributions.Suppose that(X,Y)follows the so-called bivariate binomial distribution with two marginal binomial distributions,that is,X ? Binomial(n1,?2)and Y ? Binomial(n2,?2).The existing bivariate binomial distributions in the literature can be roughly divided into three categories.The first one is to require n1 to equal n2.The second one is to require ?1to equal ?2.The third kind does not have any restriction on n1,n2 and ?1,?2.Obviously,the third kind of bivariate binomial distribution has the widest application range.The thesis construct a new bivariate binomial distribution based on the idea of Sarmanov(1966)by introducing a dependency parameter ?.The distribution meets the third requirement mentioned above.We give the explict forms of the mean,covariance,correlation and the range of the dependency parameter ?.It can be observed that random variables X and Y are positively(negatively)correlated when ? is positive(negative).Then the estimation of parameters and some hypothesis tests are involved.We estimate the parameters based on the likelihood function but there are no closed-form solutions for the MLE,so some iterative algorithms such as gradient descent method or Fisher scoringalgorithm are used for the estimation.We also apply bootstrap technique to estimate the standard error and provide the bootstrap confidence intervals for parameters.If the contingency table contains missing data,we can use EM algorithm for the estimation.In simulation studies,the specific sampling method will be involved.Since the marginal distributions of this model are both known simple distributions,that is,two binomial distributions,we can use conditional sampling method to generate random samples.The simulation studies show that the accuracy of the parameter estimation is directly proportional to the sample size.During the process of proposing this distribution,it can be observed that the parameter? measure the correlation between variables X and Y in a two-way contigency table,that is,? = 0 implies that the correlation between variables X and Y are zero.Therefore,we can use the proposed bivariate distribution to model the contingency table and then the parameter ? can be used for testing independence.In fact,a contingency table can be modelled by a bivariate Bernoulli distribution,which is the n1= n2= 1 case.However,we consider the more general case in the tesis that n1 and n2can be any nonnegative integers.We use three likelihood-based asymptotic tests: likelihood ratio test,Wald test and Score test,and construct corresponding test statistics.All these statistics follow the ?2(1)distribution under the null hypothesis.For the small sample case,we introduce bootstrap technique to the hypothesis tests and reduce the type I error rate.The simulation studies show that type I error rate of each test fluctuates near the pre-determined significance level the sample size is large,and the likelihood ratio test performs better than Wald test and Score test.If n1= n2,the result of Wald test is as same as that of Score test.If sample size is small,the testing method based on bootstrap technique significantly reduce the type I error rate of each test,which shows that the resampling method is feasible and necessary.It is worth mentioning that the odds ratio can be used to test the independence between two random variables in a contingency table.Random variables X and Y are independent if the odd ratio is one.However,the test of independence based on the odds ratio has defects.When the true values of the parameters ?1and ?2are close to 0 or1,the testing method will fail,and our proposed method can avoid this problem.Three proposed tests are still flexible when ?1and ?2are close to 0 or 1.The random variables X and Y are positively(negatively)correlated when the odds ratio is greater(less)than 1.Similarly,we can use the sign of the dependency parameter ? to determine the correlation of random variables.We also use statistical simulation to compare the performances ofthe test based on the parameter ? with the Chi-square test.The results of the statistical simulation show that the proposed testing methods is more powerful than the Chi-squared test.Therefore,the proposed testing methods are superior to Chi-squared est and the test based on odds ratio.At the end of the thesis,we consider the multiple comparison problem for the testing of independence.For K two-way contingency tables,we hope to test whether X and Y are independent in every table.For a single contingency table,we can use the proposed testing methods based on the parameter ?.Assuming that the dependency parameter corresponding to the k-th contingency table is ?k,then the hypothesis testing problem we care about can be expressed as follows:H0: ?1= ?2= · · · = ?K= 0 v.s.H1: ?i 0,? 1 ? i ? K.It can be divided into several single test “H0k: ?k= 0 v.s.H1 k 0”.The null hypothesis H0 will be reject if we reject one H0 k.Since it is possible for type I error to occur in each single test,each hypothesis test will increase the overall type I error rate of the global hypothesis test.Hence,it is necessary to apply some adjustment methods for p-value.Adjustment of the p-value is equivalent to correcting the significance level.If the significance level of the hypothesis test problem is set to ? and the p-value obtained from the data is recorded as p,then multiplying the p-value by K is equivalent to setting the significance level to ?/K.Both of two corrections are to improve the standard of hypothesis testing,thereby reducing the type I error rate of hypothesis testing.In practice,it is more common to adjust the p-value instead of the significance level ?.Various statistical software also handles themultiple comparison problem by adjusting the pvalue.Therefore,we adopt the adjustment of p-value in this thesis when considering the multiple comparison problem.The adjustment of the p-value in a multiple comparison problem starts from two aspects: one is to control the family wise error rate(FWER),and the other is to control the false discovery rate(FDR).FWER is a classic measure for controlling the error rate of multiple test problem.It is defined as the probability of making at least one type I error.The simplest but most conservative method to control the FWER is the Bonferroni method.If the multiple test problem contains K separate hypothesis tests,the Bonferroni method is to adjust the p-value by multiplying each p-value by K,and then we can judge whether to reject the null hypothesis based on the adjusted p-values.When K is relatively large,Bonferronimethod greatly reduces the type I error rate and greatly improves the type II error rate at the same time,i.e.,there will be many false negative results.In addition,commonly used methods to control the FWER are Holm method,Hochberg method and Hommel method,etc.These methods are relatively conservative.When using these methods,the power of hypothesis testing will decrease as the number of tests increases,so they are not suitable for multiple comparisons of massive data.Others meth Ods like the min P method and the max T method proposed by Westfall and Young in 1993 can also control FWER.These two methods are p-value adjustment methods based on resampling techniques.Another commonly used measure for controlling the error rate of multiple tests is the FDR.In1995,Benjamni He and Hochberg first proposed the concept of FDR,and gave a method to control the FDR in multiple tests.The definition of FDR is the expected value of the proportion of rejected true hypotheses to the number of rejected null hypotheses.FDR was not valued when it was first proposed,but with the passage of time,the emergence of massive data has made the it widely used.Commonly used methods to control FDR are BH(Benjamini-Hochberg)method and BY(Benjamini-Yekutieli)method.We introduced these commonly used p-value adjustment methods in detail,and used them for the multiple comparison of independence between two random variables in K two-way contingency tables.Based on the resampling technique,the corresponding min P method and max T method are constructed.We compared and analyzed the advantages and disadvantages of each method,and made a statistical simulation of the case when K = 2.The results of statistical simulations show that these methods can effectively control the type I error rate of multiple test problems.
Keywords/Search Tags:bivariate binomial distribution, bootstrap, dependency, multiple comparison
PDF Full Text Request
Related items