| In big data era,a large number of objects are represented in the form of highdimensional data,such as high-definition images,gene chip,audio and videos,text data,and so on.However,operating directly on large and high-dimensional data is inefficient and even infeasible for many classical methods.Moreover,the effectiveness and efficiency of these classical methods drop exponentially as the dimensionality increases,which is commonly referred to as the “curse of dimensionality”.Therefore,how to obtain key information quickly and effectively from large and high-dimensional data has become a major research topic in data mining,information retrieval,pattern recognition and other information processing fields.Dimensionality reduction is a common technique for feature extraction.At present,the problem of dimensionality reduction has received broad attentions.However,some classical dimensionality reduction methods are difficult to deal with highdimensional large sample data quickly and effectively.Therefore,the thesis focuses on a series of fast randomized algorithms for high-dimensional large sample data,and verifies the efficiency of the proposed approaches in reconstruction,recognition,clustering and other applications.The main contributions of this thesis are as follows:a)The method of generalized low rank approximations of matrices(GLRAM)is a popular technique for linear dimensionality reduction.In essence,the GLRAM-type algorithms can be viewed as a generalization to the singular value decomposition(SVD),the difference is that the former can provide low-rank approximations to a collection of matrices rather than a single matrix.However,such algorithms suffer from heavily computational overhead in practice,especially for data with high dimension.In order to reduce the cost of the algorithms,we propose a randomized GLRAM algorithm based on randomized singular value decomposition(RSVD).In addition,we provide the theoretical basis of the algorithm from three perspectives.First,we discuss the decaying property of singular values of the matrices during iterations of the GLRAM algorithm,and provide a target rank required in the RSVD process from a theoretical point of view.Second,we establish the relationship between the reconstruction errors generated by the randomized GLRAM algorithm and the standard GLRAM algorithm.It is shown that the reconstruction errors generated by the former and the latter are comparable,even if the solutions are computed inaccurately during iterations.Third,the convergence of the randomized GLRAM algorithm is investigated.b)The approximate class-specific kernel spectral regression(ACS-KSR)method is a powerful tool for face verification,which is a nonlinear dimension reduction method based on kernel functions.This method consists of two steps: an eigenanalysis step and a kernel regression step,however,it may suffer from heavily computational overhead in practice,especially for large-sample data sets.Hence,in this thesis,we propose two randomized algorithms based on the ACS-KSR method.First,we give a correction to the formula utilized in the eigenanalysis step of the ACS-KSR method.Meanwhile,we consider how to efficiently solve the ratio-trace problem and the traceratio problem involved in the modified model.Second,it is well known that kernel matrix is approximately low-rank.However,to the best of our knowledge,how to provide simple and feasible strategies to determine the numerical rank of a kernel matrix without forming it explicitly is still an open problem,and there is no stringent theoretical support at present.To fill-in this gap,we focus on the commonly used Gaussian kernel and provide a practical strategy for determining numerical rank of the kernel matrix.Third,based on numerically low-rank property of the kernel matrix,we propose a modified Nystr(?)m method with fixed-rank for the kernel regression step,and establish a probabilistic error bound on the approximation.Fourth,although the proposed Nystr(?)m method can reduce the computational cost of the original method,it is still required to form and store the reduced kernel matrix explicitly.This is unfavorable to extremely large-sample data sets.To settle this problem,we propose a randomized block Kaczmarz method for kernel regression problem with multiple right-hand sides and establish the convergence of this method.c)Kernel method is a key technique to deal with nonlinear problems.However,this method depends on the choice of kernel function.In addition,the most suitable kernel function for a particular learning task is often unknown in advance.Fortunately,the multiple kernel learning methods can overcome this problem effectively.However,to the best of our knowledge,most of the existing multiple kernel learning methods need to form and store the base kernel matrices and the ensemble kernel matrix explicitly,which may suffer from heavily computational overhead.Therefore,for highdimensional large sample problems,there is still a lack of effective algorithms for the multiple kernel learning.To fill-in this gap,we propose the corresponding fast and randomized algorithms for the multiple kernel dimensionality reduction and the multiple kernel clustering,respectively.To be specific,for the multiple kernel dimensionality reduction,we first propose a fast and randomized algorithm that avoids the explicitly formation and storage of the base kernel matrices and the ensemble kernel matrix.Second,due to the limitation of computation cost and storage requirements,it is impossible to use a large number of kernel functions in practical application,which makes it inevitable to select “representative” kernel functions from a set of pre-specified kernel functions.Hence,we define “relatively sparse” kernel weight coefficients by characterizing the correlation between the base kernel matrices,and consider how to solve them efficiently.Third,we apply the strategy of free of computing and forming the base kernel matrices and the idea of “representative” kernels to two main applications of multiple kernel learning,including the multiple kernel dimensionality reduction and the multiple kernel clustering. |