Font Size: a A A

Test On The Equality Of Two High-dimensional Correlation Matrices With Sparse Settings

Posted on:2021-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:X Z YangFull Text:PDF
GTID:2370330611497975Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the development of computer science,people’s ability to collect data is becoming stronger and stronger,and the forms of data are becoming more and more diversified,which lead to an incrediable increasement of the dimension and size of the data.Thus,a large number of high-dimensional statistical problems have come into being.Such high-dimensional data is increasingly available in the fields of genetics,finance and the Internet.For example,in the classification of proteins,we often sequence the gene pairs of proteins to distinguish different types of proteins.However,in practice,due to the high cost of gene sequencing,our sample size(n)is very small,while the gene pairs that contained in each sample(p)are tens of thousands,which caused a "small n large p" problem.For such "small n large p" problems,classical methods tend to be fail or perform a high statistical size.From the Mar?henko-Pastur distribution we can see why this kind of failure often happens in using the traditional statistical methods: in the case of the highdimensional data,the flunctuation of the eigenvalues of the sample covariance matrix will significantly deviate from the flunctuation of the eigenvalues of population covariance matrix.Therefore,the sample covariance matrix is no longer a reliable estimate for the population covariance matrix and naturally,the sample correlation matrix will also no longer a reliable estimate for the population correlation matrix.This fact leads to the poor performance of many classical statistical methods in the case of high-dimensional data.So,if the traditional methods are still applied to the high-dimensional data,then there is a great probability for us to make a Type I error.Recently,it has become the main objective for the modern statistics to propose new statistical methods to deal with the high-dimensional data.Because covariance matrix and correlation matrix both play important roles in many statistical methods,the problem about the covariance matrix and correlation matrix in high-dimensional statistical analysis always draw many attentions.Pearson covariance(hereinafter referred to as "covariance")is often used to describe whether there exist a linear relationship between two variables.Note that the real data often exist different scales,such as: height and age,weight and the quantity of the food intake.The data on different scales will make the covariance between two variables becomes uncomparable.Thus,the covariance between two variables often needs to be standardized,which can unify the scales of different kinds of data.This normalized covariance is called the Pearson correlation coefficient(hereinafter referred to as the "correlation coefficient").The matrix that composed of the covariances between different variables is called the covariance matrix,and the matrix composed of the correlation coefficients between different variables is called the correlation matrix.In statistical analysis,the problem of the equality of two covariance matrices and correlation matrices is often paid great attention because many statistical methods are based on the assumption that the covariance matrices or correlation matrices are equal.For example,Fisher’s linear discriminant analysis is based on the assumption that the covariance matrices of the two samples are equal.Therefore,in the analysis of highdimensional data,it is often necessary to check whether the population covariance matrices or the population correlation matrices are equal,or our statistical method may be difficult to be implemented.On the problem of testing the equality of two covariance matrices and correlation matrices,the traditional methods often apply the likelihood ratio test statistic,which was proposed by Kullback in 1969.This method performs perfectly in the low-dimensional case,but as we mentioned before,when dimension p relative to sample size n is very large,the sample covariance matrix is no longer a reliable estimate for the population one.Thus the likelihood ratio test statistic will lead to a high statistical size.Based on this fact,we need to develop some alternative methods to replace the likelihood ratio statistic method.Among many high-dimensional statistical methods,one powerful tool is the random matrix theory.In this theory,we would like to get the asymptotic distribution of a class of test statistics through establish the limiting distribution of the "linear spectral statistic".The advantage of this method is that many test statistics can be seen as a special case of the linear spectral statistic so that once the central limit theorem of the linear spectral statistic is obtained,the limiting distribution of other specific statistic is also obtained naturally.However,the disadvantage of this method is that the inference of the aymptotic behavior of linear spectral statistic often involves complexity,sometimes it is even impossible to get the central limit theorem of the linear spectral statstic.In this paper,we will focus on using the statistical asymptotic method to study the test for two-sample high-dimensional correlation matrices.Another powerful tool to handle on this problem is to construct the extreme valuestatistic.In Jiang(2004)’s work,he constructed an extreme value statistic for the onesample test of the high-dimensional correlation matrix for the first time and he proved that this extreme value statistic wiil tend to a Type-I extreme value distribution.Jiang’s work brought us some new insights on the test of correlation matrix.We can introduce the extreme value theory into statistical test problem and convert the problem of an extreme value statistic into a problem of the sum of independent random variable.From here,the Stein’s method can be applied to develop the central limit theorem of the extreme value statistic.In 2013,Cai,Liu and Xia also proposed an extreme value test statistic Mnfor the two-sample high-dimensional covariance matrices test,and proved that the limiting distribution of Mnis also a Type-I extreme value distribution under some sparse settings.Inspired by Cai,Liu and Xia’s work,Cai and Zhang proposed a similar test statistic Tn based on the supreme norm for the two-sample high-dimensional correlation matrices test.And they assertted that the limiting distribution of this statistic will be exactly the same as the limiting distribution of Mn.But Cai and Zhang did not offer a strict proof for their assertion.Considering the covariance matrix and correlation matrix are intrinsically different,for example: Kullback proposed a likelihood ratio statistic based on covariance matrix and proved that this test statistic converges to chi-square distribution,meanwhile,he also constructed a likelihood ratio statistic based on correlation matrix but the latter test statistic is asymptotically distributed as a linear combination of some chi-square distribution.Thus two kinds of likelihood ratio statistic do not share the same distribution.So,although Tnand Mnare both constructed based on supreme norm,we still think it is necessary for us to give a mathematical proof for their assertion.Under this motivation,we strictly proved Cai and Zhang’s conjecture: we strictly proved that the limiting distribution of Tnis indeed an Type-I extreme value distribution with exactly the same form as it is in the case of Mn.The method of proof is developed similar to Cai,Liu and Xia’s way.Due to the independent assumption is not required in our problem,the Stein’s method can not be applied directly.In our method we first prove the consistency of the normalized part of Tn,so we can replace the sample normalized part with the population one.Through this method,we simplify the denominator of Tnto be its population form,which can be seen as a constant directly.Then we use "truncation method" to prove that Tncould be approximated byits "noncentralized" form,which means that we can assume the population means and population variances are known.Thus the population means and variances can be used to replace the sample means and sample variances.Finally,by using the sparse assumptions and a generalized Bernstein’s inequality that proposed in Zaitve(1987)we can obtain the limiting distribution of Tn.In Cai,Liu and Xia’s work,a fourth moment condition is needed.This assumption is true for all elliptic distributions,but we don’t know wheather it is true for more general case.Therefore,we try to remove this distribution assumption.Hence,another contribution of our work is that we deduced the limiting distribution of Tnwithout any distribution assumption but under some alternative sparse settings.This part is based on Xiao Han and Wei Biao Wu’s work.They proposed an inequality to estimate the tail probability of the multivariate normal distribution.Based on this inequality,we can further prove that the asymptotic distribution of Tnis truely a Type-I extreme value distribution under some sparse assumptions but without the forth moment condition.This generalization extends the application of the theorem.Therefore,the main contribution of our work is to give a theoratical proof for the limiting distribution of Tn.Further,we introduce a new sparse conditions,and establish the central limit theorem in the distribution free case,which help to expand the range of the application of the result.After that,we carry out some statistical simulations for the normal distribution case and gamma distribution case,respectively.The simulation results confirmed our conclusion that the Type-I extreme value distribution is a good approximation of the distribution of Tn.At last,we summarize the whole paper and give some prospects: we want to to obtain the convergence rate of the asymptotic behavior of Tnby Stein’s method,so that we can further propose a new limiting distribution of Tnwith a faster rate of convergence.
Keywords/Search Tags:high-dimensional statistics, correlation matrix, extreme value theory
PDF Full Text Request
Related items