Font Size: a A A

Discriminant Analysis And Model Selection Methods For High Dimensional Data

Posted on:2016-10-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y L ZhangFull Text:PDF
GTID:1220330461984440Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Discriminant analysis is a kind of statistical method which uses the histor-ical data of clear classification to establish discriminant model and then classify the unknown samples. In recent years, discriminant analysis is widely used in medical science, natural science, sociology and economics and management science. The key point of discriminant analysis is to establish discriminant function based on the existing data information of each category by using the theory of probability and statistics.Then for the new sample, we can judge which category it belongs to by the discriminant criterion.Two of the most popular methods in discriminant analysis are the Bayes discriminant analysis and Fisher discriminant analysis. Bayes discriminant analysis is a method based on the information of the probabilities of the data. We need to know the probability density function of each category and the prior probability of the data belongs to each category. However, in practice it’s difficult to know the distribution density of each class, and the calculation is more complex when the number of discriminant variables is large. At this time, fisher linear discriminant analysis method is much better than the Bayes discriminant analysis.The basic idea of Fisher discriminant is to reduce dimension by projec-tion in order to simplify the multidimensional problem to a one-dimensional problem which means to compress the feature space to the one-dimension. The trouble is the sample which can be separated originally may be mixed together and can not be separated after the projection. Thus in Fisher discrimination, it’s quite important to find the projection direction. In general, we can always find the best direction which makes samples are easy to separate. That is to say, the projected values are concentrated in the same category and scattered between different categories by this direction.When β=∑-1(μ1-μ2), the rule is Fisher discriminant when the two populations have same covariance in the case p< n based on the idea of Fisher discriminant. When and ∑、μ1、μ2 are unknown, the Fisher rule, respectively, use S, X1 and X-2 to estimate them,where It has been shown that the rule is asymptotically optimal when n'∞.With the rapid development of science and technology, people begin to pay more and more attention to how to obtain the information from big data in many fields such as genomies, functional magnetic resonance imaging, risk management, signal processing, the climate and the Web search problem. In these problems, the number of variables p may be much larger than the number of samples n or the inverse sample co-variance matrix S does not exist. In this case, the error probability of Fisher criterion will be close to 1/2, that is to say the linear discriminant analysis is equal to Random guesses at this moment.To solve this problem, the author put forward the so-called Dantzig dis-criminant method and the Lasso discriminant method in second and third chapter which estimates the discriminant direction (3 directly in high dimen-sional data. The Dantzig discriminant method in chapter 2 derives from the idea that Fisher discriminant direction is just the least square solution of linear regression. We consider the penalized least squares to estimate the discrim-inant direction β. Specifically, we get the corresponding linear discriminant, criteria by minimizing the objective function |β|1+λn|X(Y-XTβ)|∞,The miselassification rate of Dantzig discriminant method is asymptoti-cally optimal if β is asymptotic sparse. We confirmed the optimal nature of the method in both the numerical simulation and real data compared to other existing methods. As our method doesu’t need to estimate ∑-1 and μ1-μ2, the sparsity of ∑-1 and μ1-μ2 is not needed and then improve the efficiency for discriminant.In the third chapter, we proposed a Lasso estimation method for linear discriminant direction β. The method is based on the idea of a generalized single index model by Wang Tao 2012,the estimation is as follows: Where Fn(Y) is the empirical distribution function of Y. Compared with most of the existing high dimensional discriminant methods, our method has the following difference:(1) Though the discriminant direction β and least squares are equivalent. Yand XTβdo not necessarily satisfy the classical additive structure in the linear model. Thus it is not appropriate to use the variable selection method in the classical linear regression model to estimate the discriminant direction β;(2) There is a big difference between the properties of regression model estimator and discriminant direction. In the classical linear regression model, {Yi,Xi}in=1 are independent and identically distributed, but in discriminant analysis,{Yi, XI}=in=1 are not independent and {Xi}in=1 come from different, pop-ulations;(3) We choose h(y) as F(y) and estimate F(y) by Fn(y) in order to make the calculation simple.We show the consistency of Lasso discriminant analysis in section §3.3 and the simulated results in §3.4. Our method performed well compared with the existing methods.In the fourth chapter, we study the properties for the method of selecting the important variables and estimating the corresponding coefficient in linear regression model when p> n via measure error model selection likelihood. As we know, the theory and method about the choice of variables in high dimensional data have been developed greatly in the past few decades. The commonly used method of variables selection is subset selection such as the AIC, KIC, GIC. BIC and Cp information criterion.But the method may be ineffective due to the large amount of computation or other reasons. Besides, the method is not stable meaning that variable selection results may change a lot when there is little change in the data. So a new method about the variable selection and estimation has been concerned widely, which is called the coefficient compression method.The compressed coefficient method is based on the idea of punishment that proposed by Tibshirani (1996) and has obtained great development in recent years, such as SCAD, Dantzig, Least angle regressions election, Elastic net and so on. These methods are used in the linear parametric regression model and select, variables through the nonzero coefficients of βj. However, if the variable is not contacted with parameter in the model, the methods can’t be used by choosing parameters, such as the non parametric model and algorithm fitting model.It’s intuitive that the variables contain the error in the model. If a variable with the error has no effect on the regression function, it means that this variable is not important so we can abandon it. Therefore, we can select the variable by the error contained in the variables. The fourth chapter build the objective function which aims to maximize the likelihood function under the condition of the punishment of the total measurement error based on the measurement error model. The specific estimates are as follows: Where σλ2= βιD(λ/1)β+σ2, D(λ/1)= diag{1/λ1,…,1/λP},λj,=1/σU2,τn is tuning parametric.The estimates obtained by β(λ) path is consistent while the consistency can only be confirmed under strong conditions in Lasso and other methods. We show the good properties of the estimator using numerical simulation.In the fifth chapter, we consider the model selection of nonparametric regression. Suppose (Xi, Yi),i.= 1,…,n are independent and identically dis-tributed samples with the joint density function p(x,y)= f(x)g(y|x), the conditional mean function is m(x). The aim is to estimate the conditional re-gression function m(x). The form of the regression model includes parametric model, non-parametric model and semi-parametric model. Parametric model assume that the form of m(x) is already known and we only need to estimate the unknown parameters of β. If the assumption is true then the precision of the estimator is high. However, the bias of the estimation would be large and the accuracy would also be poor if the rue regression function is not parametric form.Non-parametric model doesn’t make any assumptions about the form of the regression function. Statisticians put forward many non-parameter estima-tion methods such as kernel estimation, local linear estimation, local polynomi-al estimation etc. The local linear estimation is the mini-maximum estimation in the multi-dimensional. However, with the increase of dimension, local lin-ear estimation is no longer a good estimation. So the "dimension disaster" problem arises in non-parameter estimation.In view of the "dimension disaster" problem, many scholars have put for-ward restricted non-parametric model and semi-parametric models, such as the additive model, partial linear model. They solve the "curse of dimension" prob-lems to a certain extent. In fact, the additive model and the semi-parametric model still make an assumption on the regression model and the accuracy would be very poor if the model assumption is wrong.In addition, there is a kind of semi-parametric model without assuming the model form (known as the full model). Its basic idea is that some prior or experiential information may provide useful information for the form of the regression function. Then the estimator would not have large bias as the para-metric estimation and would not be unstable as non-parametric estimation.How to put these prior information into the modeling process is the key. The non-parametric model penalizing nonadditivity proposed by Studer (2005) and local-linear additive estimator proposed by Lin (2013) have built the foun-dation for this problem.We put forward a kind of punishment model selection method based on the fact that any function can be decomposed into a linear part and the remaining parts in the fifth chapter. It combined linear parts with nonlinear parts and get a family of continuous semi-parametric model rλ. When the parameter λ= 0 it becomes the full model and when λ=∞ it is the linear model. This estimator avoids the "dimension disaster" and it’s a combination of linear model and full model. When the true regression function is linear, the rate of convergence reach parameter convergence rate((?)n). We give the asymptotic properties of the estimator and exam our method by numerical examples.
Keywords/Search Tags:misclassification rate, discriminant direction, penalty objec- tive function, oracle properties, measure error likelihood function, consistency, model selection, convergency rate
PDF Full Text Request
Related items