Font Size: a A A

Two-sample Test Of Mean Vector With Dimension Reduction

Posted on:2021-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhaoFull Text:PDF
GTID:2370330611997974Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The modern society has entered the information age.The concept of “Big Data”came up in the fields of physics,biology,environmental ecology,etc.It attracts more and more attention in recent years,with the development of the Internet and the information industry.Data can be produced on a larger scale and stored at a lower price nowadays,which supports the development of relevant theories and applications.With “Big Data,people can monitor and analyze videos more efficiently.It is also beneficial for the government to establish and manage a vast fingerprint database.In finance,researchers can use big data to predict stock prices,and in medicine,scientists do genetic sequencing more efficiently and precisely as they did for coronavirus earlier this year.The “Big Data” can be used in wider fields and indeed,makes our lives more convenient.We produce large amounts of data every day,especially on social media like We Chat and Weibo.These data display the different characters of the people and makes personal services possible.By analyzing these data,personalized service and personalized medicine can be provided,which is an obvious improvement in the quality of life.It also brings us corresponding challenges,though big data has so many potential applications in our life.There is a genuine demand for us to improve existing statistical techniques for high-dimensional data.One of the critical reasons is that many traditional statistical methods have excellent properties in the situation of low dimensions yet fail when faced with the challenge of high-dimensional data,such as the Hotelling-T2 test.With the low dimensional data,it is a common multivariate test method but not applicable when dealing with high dimensional problems.The three basic elements of statistical analysis can be summarized as computational complexity,model interpretability,and mathematical accuracy.When the dimension is low,in the traditional statistical researches,the number of selected characteristic variables p is far less than the observed sample size n.On this basis,we do not need to consider other factors,so that makes sacrifices on one or more elements described above.Under the background of the big data era,the dimension of data we face is larger than before,for example,we usually use the arrays in the order of thousands for the microarray gene data.And the corresponding gene expression profile is in the order of tens of thousands.It is a challenge that the growth rate of the dimension is faster than that of samplesize.Thus,the data analysis based on the high-dimensional case has essential theoretical and practical significance.Variable selection is one of the most critical parts of high dimensional statistical inference.The data is often redundant because of several reasons: the features collected from multiple angles are correlated;much noise overflows the signal;most samples are heterogeneous.The noise causes missing data,outliers,and heterogeneity,which brings obstacles and instability to the statistical inference.We want to extract useful information from the collected data for further analysis.For high-dimensional statistical models,the principle of sparsity is widely adopted,which tells us that,though many signals of feature dimensions are collected,only a few are truly useful.In this case,the parameters of the p-dimension variable are supposed to be sparse.That is,most of the components are 0,and the remaining non-zero elements are characteristic variables.Under the assumption of sparsity,the characteristic factors can be screened out.Thus the accuracy of estimation and the efficiency of the test are significantly improved.Meanwhile,it is worth mentioning that if the sparsity of variables exists,the time of calculation can be greatly reduced.There are many classical variable selection methods,including AIC based on information theory and BIC based on the Bayesian method.In high dimensions,these traditional variable selection methods have some problems.For example,the theoretical properties of the estimation are wrong or cannot be given due to the neglect of random error terms.On the other hand,a large number of variables may lead to the complexity of calculation.In high-dimensional statistical inference,the classical variable selection method is no longer suitable for variable selection of high dimensional data.In 1937,Wilks proved that the asymptotic distribution of the likelihood ratio statistic of parametric models is a standard chi-square distribution.Owen extended Wilk's theory to non-parametric models and got the same conclusion,proving the asymptotic distribution of the empirical likelihood ratio statistic is also a standard chi-square distribution.The biggest advantage of the empirical likelihood method is that it needs no assumptions about the distribution of samples while it keeps some advantages of parametric methods,such as Wilk's theory and Bartlett correction.It is worth mentioning that the empirical likelihood method does not need to use the variance of the estimator,hence simplifies the complex variance calculation in the traditional parameter method.When constructing the confidence region,we do not need to constrain the shape of the confidence interval or construct the corresponding pivot in the empirical likelihood method.Therefore,since Owenproposed the empirical likelihood method,it has been further studied.With the advent of the era of big data,many statisticians have extended the empirical likelihood method to the situation of high-dimensional data and solved many statistical inference problems with high-dimensional large samples.Recently,methods applying penalty function to variable selection,have attracted a lot of attention.The core of variable selection,by using appropriate penalty function,is to compress smaller coefficients to zero and keep bigger coefficients.Thus,with proper penalty function,the important variables can be selected.In recent years,popular penalty functions include bridge regression(Lqpenalty),Lasso penalty(L1penalty),smoothly clipped absolute deviation(SCAD),elastic-net penalty,Adaptive Lasso penalty,and MCP.Of course,many researchers also apply penalty functions to the empirical likelihood method.Bartolucci proposes an empirical likelihood method with a penalty function and proves that the empirical likelihood ratio with a penalty function also has Wilk's phenomenon.In this paper,we apply the empirical likelihood method to a two-sample means problem with growing high dimensionality and study the case when the dimension p is greater than the sample size n.For many applications,the means of the two populations are typically either identical or differ only in a small number of coordinates.In other words,under the alternative hypothesis H1,the difference between the two means ?1-?2is sparse.In this case,it is natural to take ?1-?2to be sparse under the case of a high dimension in our study.We also know,there are some difficulties when using empirical likelihood approaches to deal with problems having high-dimensional model parameters.So,we draw the idea of a generalized moment method for variable selection and propose a new model to reduce the limitation of high dimensions.We propose a new empirical likelihood method with penalizing the corresponding Lagrange multipliers in the optimization progress,which uses the SCAD penalty to select parameters.The new scope of penalized empirical likelihood relaxes the stringent requirement on the parameter dimensionality.We prove that,without affecting the validity and consistency of the estimator,the dimensionality could be effectively reduced by penalizing the Lagrange multiplier.Our theory states that the estimator from our new penalized empirical likelihood is sparse and consistent with asymptotically chi-square distributed nonzero components.In the hypothesis testing section,the maximum marginal empirical likelihood ratio is used as a test statistic to test whether the two sample's mean are equal.By selecting a suitable index set,usingthe support set of the Lagrange multiplier as our index set,and finding the critical value,we can get an efficient test.In the numerical simulation part,we study the performance of our proposed method by setting different covariance matrices,and variable selection can be made well.In the hypothesis testing part,we find that under the alternative hypothesis,even if a small disturbance is applied,the proposed method can correctly reject the null hypothesis and has a keen sensitivity.In the case of high dimensions,even if p > n,our method still works well.Finally,we use an acute lymphoblastic leukemia data set to test whether the mean value of the gene data of the B-cell genes of patients of two molecular types is equal.The results showed that the mean value of B-cell gene data of patients with two molecular types differed significantly.The performance of the proposed method is illustrated via numerical simulations and a real data example.
Keywords/Search Tags:empirical likelihood, two-sample means test, high-dimension statistical methods, moment selection
PDF Full Text Request
Related items