Font Size: a A A

Dimensionality Reduction Techniques And Model Optimization Methods In Big Data Analytics

Posted on:2024-05-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S ChenFull Text:PDF
GTID:1520307202954639Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the development of the times and advances in technology,the ability to collect and store data has greatly improved.High-dimensional and even ultrahigh-dimensional data are increasingly common in various research fields.At the same time,the rapid growth in the number of variables leads to the emergence of a large number of irrelevant variables.To sift out useless information from thousands or even millions of variables,selecting important variables and ensuring the integrity of useful information becomes a meaningful and important task.In ultra-high-dimensional data models,the dimensionality of variables is generally much larger than the sample size,which significantly reduces the effectiveness of many classical methods.For example,least squares estimation using the inverse matrix of the covariance matrix requires that the number of variables is smaller than the sample size.When the variable dimension is greater than the sample size,the sample covariance matrix becomes singular,and the least.squares solution cannot be obtained.Similarly,in maximum likelihood estimation,the extremely high-dimensional sample data leads to a situation where the number of likelihood equations is much smaller than the number of unknown parameters,making the model parameters unidentifiable.Moreover,ultra-high-dimensional data often contain a large number of irrelevant variables.If all variables are included in a statistical model,itcan lead to significant bias in model estimation.Therefore,it is necessary to quickly and effectively select " important"predictors and reduce the data dimension.When selecting "important" predictor variables for ultra-high-dimensional data,specific models can be used to discriminate based on known model information.However,in practice,the data model is often unknown,requiring the use of model-independent discriminative indicators.This article proposes a new discriminative indicator based on the properties of quantiles to determine the independence between variables X and Y at a quantile point τ.However,the information from a single quantile point may not fully capture the correlation between variables.To comprehensively and accurately determine the independence between variables,the article introduces three different methods of integration:maximum value,average value,and weighted average value.The advantages and limitations of each method are analyzed.The discriminative indicators based on quantiles have the following advantages:(1)They do not require model assumptions and have good flexibility.(2)They are insensitive to heavy-tailed data and heterogeneous data,exhibiting strong robustness.(3)They have low computational complexity,typically only O(n).Then,based on the discriminative indicators derived from quantiles,this paper proposes a feature screening method for high-dimensional data called QuantileComposited Feature Screening(QCS).The screening indicator in this paper utilizes the weighted average value method.Firstly,the correlation indicators between each predictor variable and the target variable are calculated,and then an important predictor subset is selected using a thresholding approach.The paper presents the relevant properties of the QCS method and provides theoretical proofs to ensure its screening effectiveness.The effectiveness of the QCS method is further validated through extensive Monte Carlo simulations and real data simulations.The experimental results demonstrate that compared to several existing classical methods,the QCS method achieves superior screening performance.Additionally,a comparison of screening speeds is conducted,revealing that the computational time of QCS grows linearly,while the computational complexity of other comparative methods grows quadratically.Therefore,the QCS method possesses the advantages of high screening efficiency and fast computation speed.Once the "important" variables have been selected,the next logical step is to construct a model.While conventional parameter models have good regression properties,they often overlook the potential presence of dynamic characteristics within the dataset,which is quite common in practical applications.In order to better adapt to the dynamic features of the data and improve the model’s fit,the parameters of the model need to be transformed from fixed values to dynamic functions.This leads to the development of varying coefficient models.Varying coefficient models possess advantages that cannot be replaced by other models.They retain the interpretability of parameter models and exhibit great flexibility and adaptability.Model estimation is often based on the ideal assumption of unbiasedness,where all predictor variables are fully collected,and the conditional expectation of the error term is zero.Under this assumption,consistent parameter estimates can be obtained.However,the unbiasedness assumption is rarely met in practical applications.Firstly,the sparsity of predictor variables may not strictly hold,and "unimportant" variables may still contribute to the model.Treating these"unimportant"variables as random errors can lead to estimation bias.Additionally,it is not possible to determine with certainty which predictor variables are correlated with the response variable,especially when the dimensionality of the predictor variables is high.In such cases,a large number of "unimportant" predictor variables may be present in the data,and the accumulation of white noise effects can result in the omission of some "important" predictor variables,which cannot be completely avoided by existing screening techniques.Moreover,there may be correlations between "important" and "unimportant" predictor variables,which can lead to non-zero conditional expectations of errors.To address the aforementioned issues,this paper proposes a clever method by introducing an artificial variable into the varying coefficient model,optimizing the model to achieve unbiasedness.Under certain conditions,the model retains a linear form even after the introduction of the new variable.The paper rigorously proves the unbiasedness of the new model and provides estimation methods for model coefficients.To validate the effectiveness of the method,extensive simulation experiments are conducted.The results demonstrate that when there is a certain correlation between the omission of "important" variables or predictor variables,the proposed method outperforms existing methods in terms of both parameter estimation accuracy and prediction accuracy.Moreover,even when the model does not omit any important variables and the predictor variables are almost uncorrelated,the proposed method achieves performance comparable to that of least squares estimation.Bayesian statistical inference is a statistical inference method based on Bayesian theorem.It provides a flexible and probabilistic framework for inference and modeling,which is highly effective in dealing with uncertainty and complex problems.In Bayesian statistical inference,model parameters are treated as random variables,and probability distributions are used to represent the uncertainty of the parameters.By combining prior distributions and observed data,the posterior distribution of the parameters is computed,leading to more accurate estimates of the parameter values.Chapter 5 primarily focuses on the application of Bayesian methods in parameter updating and iteration.Drawing on the relevant theoretical knowledge of Bayesian statistical inference,it calculates the conditional posterior distribution of the parameters and provides two methods for parameter estimation:sampling methods based on the conditional posterior distribution and estimation methods based on the marginal distribution and joint posterior distribution.These methods allow for the estimation of parameter values and the characterization of their uncertainty using Bayesian techniques.
Keywords/Search Tags:Big data, Feature screening, Quantile estimation, Variable coeffi-cient model, Bayesian inference
PDF Full Text Request
Related items