Font Size: a A A

Hypothesis Tests For High-dimensional Data And Upper Expectation Parametric Regression

Posted on:2019-03-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:P DongFull Text:PDF
GTID:1360330572456652Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
In recent decades,science and technology have developed rapidly in the fields of biological sciences,medicine,information technology,finance and et al.Various types of datasets of practical problems are involved.In the face of such complicated datasets,the role of statistics is particularly prominent.In this dissertation,we mainly study hypothesis tests under high-dimensional setting and deal with datasets with distribution randomness.In the part of hypothesis tests,we consider the significance test of clustering and two-sample test of means under high-dimensional setting.In the analysis of datasets with distribution randomness,we first give the definition of distribution randomness,then construct the upper expectation linear regression model.For this model,we propose a two-step penalized maximum least squares procedure to estimate the mean function and the upper expectation of the error.These two kind of problems have a common feature,that is,the samples in the datasets may come from different distributions.The dissertation is divided into four chapters.In chapter one,we briefly intro-duce several classical clustering methods,three significance tests,two types of mean-test methods,the linear regression model and the definition of upper expectation.At the end of this chapter,the main structure of the dissertation is given.In the chapters two and three,we pay attention to hypothesis tests for high-dimensional datasets.Chapter two presents a new significance test of clustering(SigClust)with applications to cancer data.Chapter three shows Neyman's Truncation test for two-sample means with applications to leukemia data.Chapter four studies the phenomenon of distribution randomness,and then gives an upper expectation regression with related estimation procedures.Next,we introduce the last three chapters in the dissertation.Chapter two:We study the significance test of clustering for high-dimensional datasets.For large amounts of datasets,people naturally try to find patterns in these datasets and summarize them.This involves clustering methods,such as K-means clus-tering which is based on distance squares,and hierarchical clustering which is based on dendrogram.Although there exist a myriad of clustering techniques,relatively little is known to ascertain whether the empirically determined clusters present real or simply chance phenomena,especially for the datasets under high-dimensional setting.Though often neglected,this is an important step before clustering algorithm and practical ap-plication.This part of the dissertation discusses whether the given high-dimensional dataset has significant subgroups.This chapter first introduces an interesting example.We generate a sample with size n from the normal distribution N(0,1)randomly and separate the dataset into two extreme parts,that is,put the largest[n/2]data in one part and the remaining ones in the other.Then we apply the common t-statistic to test the difference between two parts.The p-value turned out to be extremely small,which implies rejecting the null hypothesis that the data come from the same population,contradicting with the true fact.It is indicated that the mean testing is not suitable to the significance test of clustering.The significance test methods of clustering are lacking in statistics currently.Liu et al.(2008)[47]proposed SigClust test,but its empirical size is too conservative to be over controlled,which somehow causes the loss of power.To solve this problem,we study the feature of the SigClust from the perspective of test statistic CI and find some cumbrous terms in it,such as duplicate terms and some nuisance terms involving the diagonals.Therefore,we remove these useless terms and propose an improved signifi-cance test statistic,denoted by BCI,which is based on sum squares of Euclidean distance between different observations.Because the Cl is location and rotation invariant,the mean of Gaussian distribution does not need to be estimated,so does the covariance matrix.Thus,the only thing necessary to be estimated is the eigenvalues of the covari-ance matrix.Under null hypothesis H0,BCI is only determined by the largest eigenvalue and sum of all eigenvalues.Under alternative hypothesis H1 and certain conditions,the power converges to 1 in probability.Compared with CI,the coefficient of variation of the BCI can be significantly reduced,implying that the new index BCI is more stable.Moreover,the new significance test(NewSig)maintains the size,meanwhile,provides a greater power.Simulation experiments and two real cancer data examples are analyzed for illustrating the performance of the new methodology.Chapter three:We suggest the Neyman's Truncation test for two-sample means of high-dimensional datasets.We have mentioned the t test in chapter two,which is the classic method for two-sample means of univariate variables.For two-samples means of multivariate variables,people prefer to choose the traditioLal Hote11ing-T2 test method.However,such a test is not suitable to high dimensional settings mainly because it involves singular covariance matrix and accumulated errors.From Dempster test to Chen-Qin test and Cai-Liu test,the current test statistics mainly include two types,"Sum-of-Squares" type and "Max" type.The "Sum-of-Squares" type test statistics have poor performance against sparse alternatives.And the "Max" type test statistic is still not powerful enough to deal with non-sparse datasets.Inspired by Fan(1996)[21],we propose a "Max-Partial-Sum" type statistic named Neyman's Truncation test,which is conducted by maximum partial sums of marginal test statistics."Sum-of-Squares" type and "Max" type statistics are two extreme situations of "Max-Partial-Sum" type statistic.Besides non-sparse datasets,Neyman's Truncation test also has great power against dense and sparse alternatives.By data transformation procedure,the statistic becomes more sensitive to detect the vanishingly small signals in high-dimensional setting.This transformation helps to weaken the dependence between variables and strengthen the weak signals.Under the null hypothesis H0,the p-value for this test can be obtained by asymptotic double-exponential distribution.And under the alternative hypothesis H1,the test based on Neyman's Truncation has asymptotic power 1.Because of the slow convergence rate of the double-exponential distribution,we need Bootstrap procedures to realize our goal.Simulation studies and the analysis of leukemia dataset are carried out to verify the numerical performance.Chapter four:We analyze the datasets with distribution randomness.Regression analysis is one of the most widely used techniques for analyzing multivariate datasets.Its broad appeal and usefulness result from the conceptually logical process of using an equation to express the relationship between response variable and a set of related predictor variables.At the beginning of this chapter,we analyze a dataset of the Fifth National Bank of Springfield in 1995.The regression relation of the annual salary on four predictors are considered:job level,education level,gender and a dummy variable with value 1 if the empolyee's current job is computer related and value 0 otherwise.By a linear regression model,together with OLS,the histogram of the residuals presents large dispersion,even appears a small cluster.We also tried some nonlinear models but the results did not show much improvement.This causes us to think about whether there are predict variables which are not observed or ignored.In fact,in regression analysis,some predictors might be either unobservable or not observed or ignored.These factors actually affect the response randomly.The ob-served data thus follow a conditional distribution when these factors are given.This phenomenon is called the distribution randomness.To analyze the datasets with distri-bution randomness,we combine the classical linear regression models and upper expecta-tion definition,then suggest the upper expectation linear regression model,and propose a two-step penalized maximum least squares procedure to estimate mean function and the upper expectation of the error.Note that,in the estimation procedures,an impor-tant step is to select available data that can be used to estimate the upper expectation of the error,which is similar to the idea of choose available marginal statistics to maximum the partial sums of squares in chapter three.The resulting estimators are consistent and asymptotically normal under certain conditions.Simulation studies and a real data example are used to show that the classical least squares estimation fails but the new penalized maximum least squares performs well.
Keywords/Search Tags:High dimensionality, Clustering, Significance test, p-value, Power, Em-pirical size, Sparsity, Neyman's Truncation, Distribution randomness, Penalized least squares, Upper expectation
PDF Full Text Request
Related items