Font Size: a A A

Some Studies On Subsampling And Variable Selection In Large-scale Data

Posted on:2022-07-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X HanFull Text:PDF
GTID:1488306728985269Subject:Statistics
Abstract/Summary:PDF Full Text Request
Data science is a continuous development and progress field.Accompanying the rapid advancement of science and technology,the reduction of data collection cost leads to the emergence of large-scale data,which is not only extremely large in scale,but also exhibits high dimensionality and complex structure.These data frequently appear in all walks of life,such as biomedicine,industrial manufacture,information technology,communication systems,earth science,finance,artificial intelligence,and others.Statisticians have been committed to developing statistical methodologies and algorithms that can effectively mine the information behind these data and further provide theoretical guarantees.In fact,the famous statistician,R.A.Fisher,pointed out in 1922 that "the aim of statistical methods is to reduce a body of data to a few quantities relevant to the objective in hand in such a way as to preserve all relevant information and exclude the irrelevant information".Due to the fast data generation,large-scale or high-dimensional attractive features of large-scale data,it leads to a series of problems in data storage,computation,analysis,and other operation technology and resource budget problems.In general,the large amount of data causes unnecessary storage or computational costs and transmission efficiency challenges,and high-dimensional data and complex structure will lead to the failure of many classical low-dimensional methods,which makes it difficult for traditional statistical methods to meet the requirements of data analysis in various fields.How to mine and identify useful information hidden behind the large-scale data for corresponding statistical analysis and inference has become a major challenge in modern statistics.In this dissertation,we attempt to provide some new solutions with the current largescale data based on the modern mainstream statistical analysis methods to the abovementioned problems.Discuss and research on the following related frontier topics:model checking in large-scale data based on structural adaptive sampling,optimal subsample for classification in high computational complexity,and identification of important variables in high-dimensional data.The first two problems are specifically about using subsample strategies to develop computationally tractable methods in testing and classification problems while without losing much estimation efficiency when the sample size is sup-large.The third problem aims at constructing a data-driven variable selection procedure with an effective error rate control when facing high-dimensional data,thereby enhancing model validity and interpretability.With the help of this research line,the main content of each part in this dissertation is briefly introduced below.Chapter 1 is the introduction,consisting of some backgrounds,concepts,and notations that will be basically used in the subsequent chapters.Specifically,we first introduce the background of large-scale data and the current research at domestic and abroad.Then,we introduce model checking,optimal subsampling,sufficient dimension reduction(SDR)related methods,and error rate control in high-dimensional data.The organization structure of this dissertation is also described.In Chapter 2,a design-adaptive model checking method is proposed based on optimal subsampling strategy.When collecting a set of data,people often directly use a pre-specified model to interpret the data but do not concern whether this data adequately fits the model.Lack-of-fit testing or model checking plays a key role to address this matter in statistical inference.At present,there are many effective model checking methods designed for small or moderate sample sizes.Despite the availability of large-scale data by modern technology,the challenges associated with model checking are not yet well addressed when some resource budgets are limited or responses are difficult to collect.As model checking is very likely one of the most preliminary steps in data analysis,practitioners would be reluctant to use much computational effort.Therefore,we aim to answer the question that "given a limited budget or resources,how can a practitioner optimally use this budget to implement model checking in a large-scale data inference".We derive an optimal subsampling strategy to select a small subset from a large data pool.To ensure that the proposed test can achieve the asymptotically best power,a two-step procedure is presented:the first step is to assign a sampling probability for each observation through a pilot study by maximizing asymptotical power,then the second step is to determine the subsample size by the limited resources and construct the test statistics using the selected subset.Another point that needs to be emphasized is that,since we are providing a general model checking framework,such as linear model,single-index model,additive model,and varying coefficient model,it is inevitable to encounter the "curse of dimensionality"in multivariate nonparametric estimation.Thus,sufficient dimension reduction technique is employed to fully utilize the information contained in the specified model.The theoretical guarantee is also divided into two parts with the user-specified reduction direction and the estimated reduction direction,respectively.The numerical results on the simulated examples and a real-world data performance demonstrate satisfactory type I error control and higher power with a significant computation and storage saving.Chapter 3 focuses on the classification task with optimal subsampling in largescale data.Classification is a hot topic in statistical analysis and machine learning.Support vector machine(S VM)stands out from series of classification algorithms with its high accuracy,flexibility,and robustness.However,its intensive computation has hindered its application to large-scale data.Varies modern methods for processing large-scale data have been developed,such as online update learning,divide-and-conquer,and subsampling.Motivated by the geometric interpretation of S VM,only a small subset of observations,called support vectors,affect the location of the separating hyperplane.It naturally enlightens us to reduce the computation of SVM from the perspective of subsampling.Inspired by the leverage score sampling for regression and matrix approximation problems,we propose a new binary classifier for the linear nonseparable SVM.Our classifier aims at selecting an informative subset from the training sample to greatly reduce the data size and further achieve efficient computation.We take a novel view of the SVM under the general subsampling framework,and rigorously investigate the asymptotic normality of the classification hyperplane parameters.The optimal subsampling probability is derived by minimizing the asymptotic variance with some optimality criteria.The theoretical difficulty is that we consider two sources of randomness brought by the training data itself and the subsampling process,which makes our method different from most existing optimal subsampling methods conditioning on the full dataset.A novel two-step optimal subsampling algorithm is proposed,consisting of a pilot study to estimate the optimal subsampling probability and a subsampling step to construct the classifier.We compared the performance of our method and existing competitors from the perspectives of estimation,prediction,and computation.Its advantage is demonstrated via several numerical studies.Chapter 4 aims at developing a new error rate control and model-free variable selection method in high-dimensional data.With the rapid development of science and technology,various types of high-dimensional data frequently appear in many fields such as genetics,finance,and many other fields.Sufficient dimension reduction(SDR)is a powerful technique to extract relevant information from high-dimensional data.However,while grasping the important features or patterns in the data,the reduction subspace usually consists of all original variables which makes it hard to interpret in practice.Researchers have developed plenty of variable selection methods suitable for various complex structures of high-dimensional data.However,existing methods only provide consistent variable selection results,but cannot reflect the global error rate control for false discoveries against underestimation or overestimation.Therefore,how to identify truly contributing variables from high-dimensional data and control the false discovery rate(FDR)has become a very important statistical problem.We propose a novel model-free and data-driven variable selection procedure in sufficient dimension reduction framework for a family of inverse regression methods via data splitting technique under both low-dimensional and high-dimensional settings.We first prove that the model-free SDR problem is equivalent to making inferences on regression coefficients in a set of linear regressions with several response transformations.A variable selection procedure is subsequently developed via error rate control by constructing some marginal symmetric statistics and a data-driven threshold.When the dimension is large,model formulation and validation would be difficult or even infeasible.Benefiting from the symmetry of ranking statistics by data splitting,our methodology distinguishes most existing variable selection methods.Not only the asymptotic distribution of the testing statistics is not necessarily required,but also the false discovery rate can be better controlled only in a data-driven way.This method is able to achieve finite-sample and asymptotic FDR control under some mildly theoretical conditions.The effectiveness of the proposed selection procedure is demonstrated on a variety of numerical experiments and a real disease cancer gene data.Chapter 5 concludes this dissertation by making a summary and providing some potential future research.Although the focus of this dissertation is only a small area of numerous frontier statistical problems,our proposed testing and classification methods based on optimal subsampling in large-scale data have strong scalability.In general,they can be readily extended to many other problems with subsampling techniques in modern large-scale data analysis as long as the design point sampling is allowed in the process,but the specific problems merit further investigation.The data-driven variable selection procedure via error rate control in high-dimensional data can also be extended for intensive research,such as factor models,low-rank matrix estimation,and other fields.
Keywords/Search Tags:Classification, Data-driven, Data splitting, FDR control, Large-scale data, Model-free, Multiple testing, Optimal subsampling, Sufficient dimension reduction, Support vector machine
PDF Full Text Request
Related items