Font Size: a A A

Variable Selection And Feature Screening Methods For Some Different Models

Posted on:2019-06-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ChuFull Text:PDF
GTID:1360330572456705Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Due to the rapid development of the recent technology,researchers are faced with more and more big data among what the dimension of variable is usually large.Examples can be found in gene regulatory network,gene expression microarray,single nucleotide poly morphism and financial data etc.When the dimension of variables grows,we found that most of the variables are actually unrelated with the response.There contains huge information in mass data.But besides some useful information,there will also be some redundant information and wrong information.If we make use of all the data into statistical analysis without selection,it will not only increase the computational complexity but also have negative impact on the result of the data analysis.Hence the significance of our research lies in how to maintain effective information while eliminating useless variables from those complicated data.Under ultrahigh dimensional models,when the variables dimension p grows expo-nential with the sample size n,most classical variable selection methods become invalid.For example,the sample covariance matrix is usually singular in classical least squares method as the dimension grows.The parameters of the maximum likelihood estimator are not identifiable if the number of likelihood equations is much smaller than the num-ber of unknown parameters.For the above reasons,researchers need to seek innovative variable selection methods which are adaptive to the high dimensional data.As a result,many methods appeared such as penalized least squares,penalized maximum likelihood,penalized empirical likelihood and feature screening.This article concludes the classical methods of high-dimensional feature screening and variable selection,and propose new innovative methods.We mainly focus on fol-lowing five questions:(1)the correlation among high-dimensional variables;(2)the usage of predetermined information;(3)the consideration of model-free;(4)the curse of dimensionality in nonparametric regression;(5)the outliers existed in the data.These considerations promote us for several feature screening and variable selection methods.The main frame of this article contains five chapters.In the first chapter,we introduce some common variable selection and parameter estimation methods under some different regression models.We also described the development history of outlier detection briefly.The second chapter studies the empirical likelihood and conditional SIRS.We propose a new feature screening method to solve questions(1-(3).In chapter three,variable selection is performed under general single-index models which combine the advantages of parameter models and nonparametric models.The proposed method avoids the curse of dimensionality under nonparametric models.It is worth mentioning that though model transformation,a complex and hard estimated general single-index model is transformed into a simple linear model.The fourth chapter detect the outliers existed in the data.We make use of the penalized method and empirical likelihood to construct a robust variable selection method.Chapter five is the concluding chapter.1.Chapter two.In this chapter,we first studies the empirical likelihood idea.As we all know,empirical likelihood is actually a nonparametric method.It is based on a data-driven likelihood ratio function.Compared with other method such like maxi-mum likelihood,empirical likelihood does not need restrictive model structure and error distribution assumptions.And it can make full use of many constrain information and predetermined information.The empirical likelihood can be viewed as a model-free pro-cedure.This chapter takes advantage of this point and apply it into nonparametric feature screening.To solve problem(1),we consider the conditional SIRS.Conditional SIRS deals with the marginal correlation function in the classical SIRS and reduces cor-relation between Xk and conditional variables.We propose a marginal utility based on the empirical likelihood ratio and achieve the feature screening purpose by sorting this marginal utility.First of all,we study the SIRS and find that it relies on the correlation among variables strongly.Then it is natural to construct the conditional version to reduce the correlation.The resulting conditional marginal correlation function can measure the correlation between Xk and response given Xc.The marginal empirical likelihood ratio based on this correlation function is:where ? is the lagrange multiplier,gkt(C)= E2?[Xk-(Xk|?CTXc)]1(Y<Yl)}(k?D).By sorting this marginal utility,the feature screening method is performed.It should satisfy two points.Firstly,if Xk is unimportant variable,then lk(C)should be small.Otherwise,if Xk is important,the corresponding lk(C)should be large.These two points makes sure that our screening method could rank the important variables at the top position when the marginal utilities is arranged in descending order.The sample version of the marginal utilities is shown in our article.The subset of important variables is M?n={k?D:lk(C)??n}.To ensure that the selected subset contains real active predictors,we show the distribution properties of the marginal empirical likelihood ratio lk(C)under population and sample version.Theorem 2.1 and 2.2 demonstrate that the corresponding marginal empirical likelihood ratio should not be small if the kth variable is important.Therefore it could be chosen by sorting the marginal utilities.We construct the sure screening property in Theorem 2.3 to ensure all the important variables could be involved in our selected subset to perform some precise variable selection method further.Besides,we illustrate in Theorem 2.5 that the number of variables contained in M?n are not too large.Our procedure maintains the advantages of empirical likelihood and SIRS.The resulting conditional marginal empirical likelihood ratio method is model-free and enjoys the sure screening property.We solve the question(1)to(3)effectively.The simulation study and real data analysis results confirm the excellent characteristic of our method.2.Chapter three.The general single-index model is studied in this chapter:Y = G(XT ?,?),where G(·)indicates the unknown link function.This model covers various types of semi-parametric models including single-index model,heteroscedastic model and is widely used in biomedical science and econometrics.The response Y depends on X only though XT?(a linear combination of X)in the general single-index model.the introduction of the semi-parametric model avoids the curse of dimensionality in nonparametric models effectively and shares both the flexibility of nonparametric regression and the inter-pretability of linear regression.The purpose of the third chapter is to perform variable selection under this model.To handle this model directly can be challenging because the link function G(·)in unknown and the relation between error and response.Thus we redefine a new parameter ?F=?-1?.According to Lemma 3.1,under adequate conditions,?F is proportional to the initial parameter ?.We can achieve the dimension reduction as long as we select the nonzero ?F.On the basis of definition of ?F,we construct a transformation model F(Y)-1/2=XT?+?,where ? is p-dimensional parameter,F(·)is the distribution function of Y,? is the error of the new model.The distribution of the error in unknown.Thus the complicated general single-index model is transformed into a common linear model which is simple to handel.But the transformation lost some information of error.The least square method is not applicable when analysing the new model because it is very sensitive to error distribution.A natural solution is to estimate the probability density function of ? by nonparametric kernel density estimation.We propose a robust profile likelihood method for parameter estimation.This estimation method makes use of the nonparametric kernel density estimation of the new error to construct the likelihood.For variable selection purpose,we use the following penalized profile likelihood to perform variable shrinkage and parameter estimation simultaneously Based on this estimation,the final subset of important variables is M0={j:?j?0,j=1,…,p}.Our method does not need to estimate the link function,thus in some extent it is more simple and convenient.The requirement for the assumption of the error is more extensive and it is robust for heavy tailed error and infinite error variance.Theorem 3.1 and 3.2 show that the resulting estimator enjoys the consistency and asymptotic normality.Theorem 3.3 indicates the estimator satisfies the oracle property.This means that our estimator is converge to the real parameter.In the aspect of measuring the superiority of variable selection,we also demonstrate the selected subset M0is equivalent to the real important variable set in great probability which means our method can select the real model correctly.The simulation study and real data analysis prove these properties.3.Chapter four.We focus on problem(5)in chapter four.In data gathering,there maybe outliers due to measurement error or man-made factors.The key point of this chapter is to eliminate the contamination and to obtain a good variable selection and parameter estimation result.In the presence of outliers,many classical statistical methods are faced with the risk of inefficiency.Therefore,we need to find a new robust estimation method to identify the outliers in the data.We adopt the mean shift linear regression model y=X?+?+?,whery y=(y1,…,yn)T is the response,X=(X1T,…,XnT)is the design matrix,?=linear shift parameter.When ?i=0,the ith observation is not an outlier.When ?i?0,the ith observation is an outlier.The purpose of this chapter is to perform variable selection,parameter estimation and outlier detection under the mean shift regression model.In this model,the unknown parameters are ? and ?.The total dimension is n ?p,which,is larger than the sample size n.Thus from this point of view,this is a high dimensional variable selection prob-lem.To attain our purpose,we need the sparsity assumption.We think most ?j = 0 which indicate most of the variables are not important.? is sparse too which means that a substantial number of observations are normal data although the data are contami-nated.Hence the variable selection and parameter estimation can still be carried out after removing the outliers.According to the error distribution,we have the estimating equation 1/n?i=1nXiT(yi-Xi?0-?0i)=0.The constraint empirical likelihood can be derived from this estimating equation.Consider the sparsity of ? and ?,we adopt the penalized empirical likelihood idea where g(Zi;?,?)=XiT(yi-Xi?-?i),p2(|?i)is the adaptive penalty using an adaptive lasso type penalty function.The initial value of which is the residual of SLTS estimation.We use this residual as the weight of penalty to guarantee that the weight of nonzero ?i is large and the weight of zero ?i is small.The results show that our method enjoy the high breakdown point and have full asymptotic efficiency.The theoretical studies confirm the properties.The estimation is consistent which means the difference between our estimator and real parameter is small.The simulation studies indicate the good performance of our method under different proportion of contamination and different outlier types.Real data analysis shows the variables we selected enjoy the sparsity.4.Chapter five.We make a conclusion of our paper and suggest the future studies.
Keywords/Search Tags:feature screening, variable selection, nonparametric models, semi-parametric models, general single-index models, empirical likelihood, outlier
PDF Full Text Request
Related items