Font Size: a A A

Distributed Sparse Quantile Regression By Feature Space Partitioning

Posted on:2020-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:J H LiuFull Text:PDF
GTID:2370330590495170Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the massive sample size has brought more con-venience to our daily life.For instance,the price of whole genome sequencing has fallen dramatically in genomics.The same is true in other areas such as analysis of surveil-lance videos,biomedical imaging,retail sales,social media analysis and high-frequency finance.The current trend for data to be produced and stored on a larger and cheaper s-cale is likely to be maintained or even accelerated in the future.This trend will make big differences on business,engineering and science.For instance,scientific progresses are becoming increasingly data-driven and researchers will increasingly consider themselves to be consumers of data.Therefore,effective statistical analysis of big data is becoming more and more important.Big data has many applications in other disciplines,such as finance,economics,ge-nomics,neuroscience and so on.For instance,in social network,huge amounts of social network data are generated by WeChat and Weibo every day.These data show the personal characteristics of many people and enable these characteristics to be developed in various fields.For instance,some researchers use these data to predict stock market movements,epidemic viruses and movie box office.Furthermore,the Internet and social media have a wealth of information about consumer preferences which can lead to the business cycle,economic indicators,socioeconomic status,and political attitudes.The data from social network are bound to continue to explode and be available for more new applications.Several other new applications that are becoming possible in the big data era include:personalized services,Internet security,personalized medicine and digital humanities.However,big data produces plenty of new challenges to data scientists while bring-ing benefits to our life.So what are the challenges of big data analysis?Big data have the characteristics of large sample size and high dimensionality.These two characteristics bring out three unique challenges:?1?high dimensionality characteristics bring about in-cidental homogeneity,spurious correlations and noise accumulation;?2?the combination of large sample size and high dimensionality leads to problems such as unstable algorithm and heavy computational cost;?3?the large samples in big data are usually collected from massive sources using different techniques at different time points.In order to deal with the challenges posed by big data better,we need to explore new computational methods and statistical thinking.For instance,many traditional methods perform well in the case of moderate sample size,but can not scale to the case of massive data.Likewise,plenty of statistical methods that work well in the face of low dimension-al data problems often fail in the analysis of high dimensional data.In order to design statistical programs that can effectively predict and explore big data,we need to find a balance between computational efficiency and statistical accuracy.As for the statistical accuracy,variable selection and dimension reduction play crucial roles in the analysis of high-dimensional data.As for the computational efficiency,big data gives impetus to the development of new data storage methods and basic computing facilities.Optimization is just a tool for big data analysis,not the purpose of big data analysis.This paradigm change has made significant progresses in the direction of fast algorithms which can also be ex-tended to high-dimensional large-scale data analysis.This has led to mutual promotion between different fields,including applied mathematics,optimization and statistics.Computational complexity,model interpretability and statistical accuracy are the three major factors in the process of statistical analysis.In traditional studies,the number of feature variables p is far less than the number of observed sample sizes n.Under these circumstances,the three factors do not have to sacrifice each other for the efficiency of others.Nevertheless,there are many problems in the conventional methods when the ob-served sample sizes n is far less than or equal to the feature dimension p.These problems contain how to balance the stability and computational efficiency of statistical programs;how to interpret estimation models;how to realize non-asymptotic or asymptotic theory;and how to propose more logically efficient statistical programs.And under the rapid development of modern science and technology,the scale of the data sets have become enormously.So there is another more complicated problem:how to store and process the big data when the sample size n or the feature size p is much larg-er than the storage limiting of an ordinary machine?This problem not only attracted the attention of computational scientists in the past decade but also became an interview prob-lem for many high-tech companies.However,this is actually only a calculation problem for massive data sets,and does not involve any statistical modeling problems.In the analysis of high-dimensional data,the sparsity principle states that only a few factors have an effect on the results.This principle is widely adopted and considered to be feasible.The problems of variable selection in ultra-high dimensional feature space are increasing in big data analysis,so the new statistical theories and methods are urgently needed.For instance,while studying the interactions between different proteins,the sam-ple size is only a few thousand in order of magnitude,but the feature space is over several million in order of magnitude;while using microarray gene data for disease classification,the order of magnitude of array is generally tens of thousands,but the order of magnitude of gene expression spectrum is over tens of thousands;while studying the genetic relation-ship between phenotypes and genotypes,the magnitude of the theirs is almost the same.In these cases,we need to identify significant features that contribute to the response and ac-curately predict feedback after some clinical interventions.The current variable selection technique can be applied to ultra-high dimensional space through a series of transforma-tions.The assumption that makes high-dimensional statistical inference possible is that the regression function is in a low-dimensional manifold.In these situations,suppose the parameters of the p-dimensional regression are sparse,most of which are zero,and the remaining non-zero components are valid feature variables.Under the condition of sparsity,effective impact factors can be selected through feature variable selection,so as to improve the accuracy of estimation and the interpretability of big data model.When the sparsity is particularly high,the feature variable selection can also greatly reduce the computational cost.In the case of high-dimensional data,the Lasso method will encounter problem-s of computational time and computational complexity when solving the linear regres-sion problem.In chapter 2,some optimization algorithms on solving Lasso problems are introduced.The gradient descent method is a first-order algorithm that applies local information to iterate,but the number of iterations is too large.We can obtain the opti-mal algorithm of dual cone by adding smoothing and dual methods,and this method is more efficient and stable than the general gradient descent method.The alternating direc-tion method of multipliers combines the advantages of distributed convex optimization to alternately performs ridge regression on the Lasso problem to achieve the purpose of accelerating convergence speed in massive data sets.The coordinate descent method s-elects the maximum descent direction by using the idea of optimizing the sub-function in the objective function to reduce the computational time and the storage space require-ment.However,the traditional optimization algorithm can not analyze the regression and classification problems effectively with big data under memory constraint.Then we introduce a median selection subset aggregation estimator algorithm by sample space partitioning.This algorithm can process and analyze the data effectively in the linear regression models when the sample size is much larger than the storage of machines.As for the situation that the feature size is much larger than the storage of ma-chines,we firstly introduce the Bayesian split-and-merge algorithm.But this method can not guarantee the efficiency of screening.Then we introduce a parallel feature selection algorithm based on group test which relies too much on the independence assumption between the groups of features.Finally we introduce a decorrelated feature space parti-tioning algorithm that decorrelated subset data first and partition the feature space to select variables.The Lasso method is unrobust with heavy-tailed error term in the general linear re-gression.But the quantile regression is not affected by the distribution type of the error term and obtain a more robust regression model.We introduce the classical quantile re-gression model and the general divide-and-conquer algorithm that is insufficient for the massive data sets on statistical estimation.Then we introduce the linear estimator for quan-tile regression algorithm under memory constraint.This algorithm partitions the sample space first and smooths the regression by kernel function,then transforms it into a quadrat-ic function and obtains a L1regularization estimator with penalty term.This algorithm combines the Lasso and quantile regression effectively to solve the quantile regression models with massive data sets.Inspired by message algorithm and DECO algorithm,we propose the sample space and feature space partitioning algorithm in quantile regression under the condition that the sample size and the feature size are both much larger than the memory size.This algorithm is efficient in variable selection and estimation on quantile regression by combining the advantages of sample space partitioning and feature space partitioning.Then we make simulation experiments comparing with the results of the Lasso method on full data.From the results of simulation experiments,we can figure out that our algorithm is more efficient than the Lasso method on full data under the condition that the number of partitioning is suitable with low correlation between samples,and our method is still robust when the error term is of slightly heavy-tailed.However,when the correlation between samples is very high,our method is not very effective due to the weak decorrelation steps.Although the efficiency of our method,it cost more computational time than the Lasso method on full data.The reason is that the decorrelation step costs too much time and this can be reduced by optimization algorithm.Finally,our method performs the same efficiency as XGBoost algorithm on a real data example about the regression on superconducting critical temperature.In reality,there are a large number of data sets that contain heavy-tailed error terms.And our algorithm is not so good for the heavy-tailed situation.So we need to explore more robust method on sample space partitioning and feature space partitioning to adapt to the varied big data.
Keywords/Search Tags:quantile regression, sample space partitioning, feature space partitioning
PDF Full Text Request
Related items