Font Size: a A A

Research On Statistical Methods And Their Application To Complicated Data

Posted on:2021-03-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:M YanFull Text:PDF
GTID:1360330626955681Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Statistical methods are widely used in the research of nature,economy,society,science and technology,etc.As one of the effective data analysis methods,it can not only mines effective information,find potential laws for the development of things,but also gives corresponding scientific theoretical basis.With the advance of the statistical application,we are facing complex and diverse data types,and the traditional methods are suffering more and more challenges.In this thesis,we further study the statistical methods of several types of high-dimensional complex data,and apply these statistical approaches to practical problems.The main research contents are as follows:(1)The problem of how to determine the number of factors in matrix decomposition of high-dimensional non-negative data is studied.Since the introduction of non-negative matrix factorization,the research on non-negative matrix factorization has been more comprehensive.Successful applications rely heavily on the ability to correctly specify the number of factors.However,a fully data-driven method for determining the number in the process of non-negative matrix factorization is not yet available in the literature.In chapter two,we propose a fully data-driven type of factor determination method based on a two-step delete-one-out approach,called twice cross-validation(TCV).The method applies CV first to the observations and then to the variables in the observations again.Compared with the existing information criterion criteria and panel criteria for specifying the number of factors,our method is not influenced by the parameter adjustment,but also has simple calculation.Therefore,the TCV method is also effective for complex nonnegative matrix factorization models.Simulation experiments show that the proposed method can select the appropriate factor number in many cases.Finally,we apply the TCV method to the source analysis of air pollution in Singapore.All the selected number of factors can be found to have reasonable interpretations.(2)We investigate dimensionality reduction of quantile regression model with censored data.Considering the complex case which both the dependent variable and the censored variable follow a multi-index structure with the covariates.In chapter three,we first study the sufficient dimensionality reduction(SDR)for both survival time and censoring time simultaneously.In order to estimate the SDR space of the dependent variable and the censored variable as well as their joint SDR space,a new estimation method is proposed based on the iterative and structural adaptive methods,and the structural dimensions of the individual SDR spaces are given by the cross-validation method.The asymptotic property is also deduced.In the simulation,we compare the estimation efficiency of our method with parametric models such as the Cox proportional hazards model.The research shows when the model is assumed correctly,the estimated efficiency of the two methods is equally good,otherwise our method outperforms the latter.When applying our method to the popular primary biliary cirrhosis data,the new method not only identifies an important predictor of patients' survival time,but also marks the ascites.Practice shows that ascites is indeed an important indicator of primary biliary cirrhosis in the late stage.However,this correlation has not discovered in previous studies.(3)The problem of measuring and testing independence of time series data is studied.Non-linear time series have attracted extensive attention from scholars.In non-linear cases,the results of testing nonlinear time series dependence using autocorrelation coefficients are always undesirable.In chapter four,we consider an extension for time series data based on a novel nonparametric testing approach to measure the independence of two random variables.The new time series independence test statistic is named composite coefficient of determination.The value of the test statistic is between 0 and 1.Its value is 0 if and only if the series is independent.Since the test method is distribution-free and invariant under any monotonic transformation of data,it is robust to heavy-tailed distribution and outliers,which is extremely important for financial data.In order to avoid that two different lag orders may give opposite conclusions,in addition to studying the individual test at different lags,we also discuss the corresponding portmanteau tests uses multiple lags.A large number of experiments show that under different lag orders,our test and its corresponding portmanteau test have a reasonable test level under independent sample data.Under non-independent samples,it shows higher test power.Finally,our method is applied to the S&P 500 index to test the hypothesis of random walk of the stock price and the i.i.d.hypothesis for standardized model residuals respectively.
Keywords/Search Tags:high dimensional data, non-negative values, censored data, time series data, statistical methods
PDF Full Text Request
Related items