| High-dimensional data whose dimensions are usually close to or even exceed their sample sizes in life are collected and stored.Compared with traditional data analysis,high-dimensional data analysis is more complicated and difficult.Using machine learning algorithms to analyze high-dimensional data has become a research hotspot in the data field.The asymptotic and non-asymptotic theory of random matrices transcend the framework of classical multivariate statistical analysis,which are fit to study the statistical characteristics of high-dimensional data and help machine learning algorithms to analyze high-dimensional data.In this dissertation,we analyze the problems existing in traditional machine learning algorithms for high-dimensional data analysis.Based on the relevant research results of random matrix theory,we propose a regularized discriminant analysis algorithm,a regularized discriminant analysis algorithm based on a good estimate of mean value,and a dimension reduction algorithm for high-dimensional data with missing observations.The main research contents are as follows:(1)Linear discriminant analysis(LDA)performs well in many applications,but it is not suitable for analyzing high-dimensional data.A primary reason for the inefficiency of LDA for high-dimensional data is that the sample covariance matrix is no longer a good estimator of the population covariance matrix when the dimension of data is close to or even larger than its sample size.A regularized discriminant analysis method based on random matrix theory is proposed.Firstly,in order to make a good estimate of the high-dimensional covariance matrix,nonlinear shrinkage and eigenvalue clipping method are applied.Then,using the estimated high-dimensional covariance matrix to calculate the discriminant function values which is used for classification experiments.Experiments on simulated datasets and real datasets show that our proposed algorithm not only has a wider range of applications,but also has a higher classification correct rate.(2)Sample mean in discriminant model is also affected by high-dimensional data,and the estimate of sample mean is biased,which leads to the increase of misclassification rate of discriminant model.A regularized discriminant analysis based on a good estimate of mean value is proposed.On the basis of the regularized discriminant analysis algorithm,the mean shrinkage model is estimated again using the optimal shrinkage estimation method.Then,the sample mean value is used instead of the re-estimated mean value in the discriminant model,which further improves the classification performance of the regularized discriminant model.The experimental analysis of simulated datasets and real datasets also shows the superiority and effectiveness of our proposed algorithm.(3)Some data may be lost in the process of data collection and storage.Most data analysis methods are difficult to analyze high-dimensional missing datasets or the analysis results are not satisfactory when high-dimensional datasets contain missing values.A principal component analysis(PCA)algorithm for dimension reduction of high-dimensional missing data is proposed.Firstly,based on the related research of random matrix theory,the covariance matrix estimation of high-dimensional missing data is obtained by Lasso estimation of matrix.Then,we decompose it,select the main feature vector to form a low-dimensional projection matrix,and use the projection matrix to project high-dimensional data into low-dimensional space.Finally,a linear discriminant analysis combined with an improved PCA is used to classify high-dimensional missing data.The classification results on simulated datasets and real datasets show that our proposed algorithm can reduce the dimensionality of high-dimensional missing data and improve the classification accuracy of linear discriminant analysis algorithm on high-dimensional missing data. |