Font Size: a A A

The Study Of Graph-based Embedding And Dimensionality Reduction Methods

Posted on:2014-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhuFull Text:PDF
GTID:1228330398972828Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
During the past decades, the advances in data collection and storage capabilities have caused "information overload" problem in many researche areas. Researchers face larger and larger observations and simulations on a daily basis. Such datasets, in contrast with smaller, more traditional datasets that have been studied extensively in the past, present new challenges in data analysis. Luckily, the main source of the variantion in many real datasets is controlled by a small set of hidden variables. Similar situations emerge in large-scale data arising in many areas of research, such as bioinformatics, robot navigation, and natural language processing. The small number of hidden units that capture the majority of variation in the data describe a low-dimensional manifold which can be represented as a graph specifying which points on the manifold are neighbors. Properly embedding this graph can recover the hidden low-dimensional coordinates, allowing for better data exploration, visualization, and modeling of the complex data. Toward this goal, the main work of this paper can be described as follows:(1) Linear discriminant analysis (LDA) is one of the most popular approaches for supervised feature extraction and dimension reduction. However, the computation of LDA involves dense matrices eigendecomposition, which is time-consuming for large-scale problems. In this paper, we present a novel algorithm called Rayleigh-Ritz Discriminant Analysis (RRDA) for efficiently solving LDA. While much of the prior research focus on transforming the generalized eigenvalue problem into a least squares formulation, our method is instead based on the well-established Rayleigh-Ritz framework for general eigenvalue problems and seeks to directly solve the generalized eigenvalue problem of LDA. By exploiting the structures in LDA problems, we are able to design customized and highly efficient subspace expansion and extraction strategy for the Rayleigh-Ritz procedure. To reduce the storage requirement and computational complexity of RRDA for the small sample size problems, we also establish an equivalent reduced model of the RRDA. Practical implementations and the convergence result of our method are also discussed. Our experimental results on several real world data sets indicate the performance of the proposed algorithm.(2) Regularized linear discriminant analysis (RLDA) is a important subspace learning method proposed recently to handle the small sample size (SSS) problem of LDA. One important unsolved issue of RLDA is how to automatically determine an appropriate regularization parameter without resorting to unscalable procedures like cross-validation. In this paper, we develop a novel efficient algorithm to automatically estimate the regularization parameter based on a geometric interpretation of RLDA. We further provide a formal analysis of the proposed method, and show that it is robust to the perturbation in the feature space of the training data. The extensive experiments on various benchmark datasets verify the scalability and effectiveness of our approach, compared with the state-of-the-art algorithms.(3) Protein-protein interaction (PPI) networks provide insights into understanding of biological processes, function and the underlying complex evolutionary mechanisms of the cell. Modeling PPI network is an important and fundamental problem in system biology, where it is still of major concern to find a better fitting model that requires less structural assumptions and is more robust against the large fraction of noisy PPIs. In this paper, we propose a new graph embedding approach called t-logistic semantic embedding (t-LSE) to model PPI networks. t-LSE tries to adaptively learn a metric embedding under the simple geometric assumption of PPI networks, and a non-convex cost function was adopted to deal with the noise in PPI networks.The experimental results show the superiority of the fit of t-LSE over other network models to PPI data. Furthermore, the robust loss function adopted here leads to big improvements for dealing with the noise in PPI network. The proposed model could thus facilitate further graph-based studies of PPIs and may help infer the hidden underlying biological knowledge.
Keywords/Search Tags:Graph embedding, Graph-based dimensionality reduction, Lineardiscriminant analysis, Protein-protein interaction network
PDF Full Text Request
Related items