Font Size: a A A

Study On Kernel Methods For Large-scale Data Set

Posted on:2009-06-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Y ShiFull Text:PDF
GTID:1118360272489294Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Principal Component Analysis(PCA) is a classical technique for feature extraction and dimension reduction.It uses the dimension with the largest variances and neglects the less important components,which maps the sample vectors into the low dimensional subspace.But it does not work well when data is nonlinear.With the idea of kernel approach,Kernel Principal Component Analysis(KPCA) has been generalized to treat with the nonlinear data.Its main idea is to map the data set from the input space into high-dimensional(even infinite dimensional) feature space.Thus,the nonlinear components can be extracted using linear algorithm in the mapping feature space.The extracted the nonlinear feature can be used in many complex applications,such as image processing,etc.In practice,the mapping function is neither calculated nor kept explicitly, but realized implicitly via the kernel trick.The KPCA generally needs to store the kernel matrix of all data,which takes the space complexity ofO(m~2),where m is the number of data samples.In addition,it needs the time complexity of O(m~3) to extract the kernel principal components.But traditional kernel function is based on the inner product between sample vectors,the size of kernel matrix scales with the number of data set. When faced with the large-scale data set,it is infeasible to store and compute the kernel matrix because of the limited storage capacity.Consequently,some approaches must be adopted to account for the inconvenience.In order to solve the problem of the large-scale data set,some approaches have been proposed to compute kernel principal component.After discussing the essences and relations of these approaches,we proposed three methods,which can effectively solve the problem of large-scale data set.The work is based on the thorough study of kernel methods in pattern analysis.The main work is summarized as follows:●First,the Gram matrix is transformed into the two triangular matrices using incomplete Cholesky decomposition.Then each column of the triangular matrix is treated as the input sample for the iterative PCA algorithm.Thus,the kernel principle components can be iteratively computed without the eigen-decomposition.Its space and time complexity is O(nm) and O(nm) + O(nkp),respectively,where n,k,m,p is the rank of Gram matrix,the number of extracted principal components,samples and iteration, respectively.In large-scale data set,the rank of Gram matrix is greatly less than m and the extracted number of principal components is also less than m.The complexity is greatly reduced.●First,a matrix,called Gram-power matrix,is constructed using the original Gram matrix.It is proven by the theorem of linear algebra that the eigenvectors of newly constructed matrix are the same as the ones of the Gram matrix.Therefore, we can treat each column of the Gram matrix as the input sample for the iterative PCA algorithm.Thus,the kernel principle components can be iteratively computed without the eigen-decomposition.The space complexity of proposed method is only O(m).●We propose a new framework,Matrix-based Kernel Principal Component Analysis(M-KPCA).By dividing the large-scale data set into small subsets,we could treat the autocorrelation matrix of each subset as the special computational unit.A novel polynomial-matrix kernel function is adopted to compute the similarity between the data matrices in place of vectors.Because the number of subset greatly less than the number of data set,the acquired kernel matrix is easy to compute.The autocorrelation matrix contains statistical information of each subset,which helps to improve the performance.The effectiveness of the proposed methods is demonstrated by the experimental results on the artificial and real data set when faced with large-scale data set.
Keywords/Search Tags:large-scale dataset, KPCA, kernel method, kernel matrix, nonlinear feature, matrix decomposition, iterative algorithm, data subset
PDF Full Text Request
Related items