Font Size: a A A

Selection Of Subsampling Method Based On Linear Model

Posted on:2024-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y N WangFull Text:PDF
GTID:2557307112489534Subject:Statistics
Abstract/Summary:
Big data is being used more and more widely in daily life.The characteristics of big data is mainly reflected in the large number of samples and high dimensionality.Because of these characteristics,big data brings unprecedented challenges to the research and development of statistics.For example,high dimensionality can bring pseudo-correlation,which can lead to spurious scientific findings and erroneous statistical inferences;Processing data with large sample size requires good performance of the computer,but ordinary computers are unable to store and analyze the data.Although there are ways to deal with big data,the efficiency of these methods is often prohibitive due to the amount of time they take.Therefore,a more efficient method is needed to deal with the problems brought by big data.Taking linear models as an example,this thesis considers subsampling to reduce the amount of data,that is,extravting a representative subdata set from the complete data set for subsequent analysis.In this thesis,four subsampling methods are studied,namely uniform random sampling(UNI)method,orthogonal array based subsampling(OSS)method,uniform projection design based subsampling(UPS)method and leverage score based subsampling(LEVSS)method.For the possible pseudo-correlation problem caused by high dimensionality,this thesis considers the linear model selection under different methods according to the selected subdata sets.The methods used in this thesis specifically refer to the optimal subset selection method based on the Bayesian information criterion(BIC),the LASSO method and the SCAD method in the shrinkage estimation method.In the section of numerical simulation,for data sets from different distributions,three criteria are used to evaluate the property of the four sub-sampling methods.These three criteria are respectively the probability of choosing the real model,the probability of choosing the model containing all variables in the real model,and the mean square error of parameter estimation.The simulation results show that the UPS method and the LEVSS method have better performance than the other two methods,and the UPS method is further superior to the LEVSS method in most cases.
Keywords/Search Tags:Big data sampling, Shrinkage estimation, Linear model selection, Uniform projection design
Related items