Font Size: a A A

New Geometric Approaches for Several Machine Learning Problem

Posted on:2019-09-30Degree:Ph.DType:Dissertation
University:State University of New York at BuffaloCandidate:Liu, YangweiFull Text:PDF
GTID:1478390017986652Subject:Computer Science
Abstract/Summary:
Computational geometry has found applications in many different real world problems--both theoretical and applicational ones. In this dissertation, we apply geometric techniques to multiple problems, including traditional machine learning problems, deep learning related problems, and a biological problem that can be modeled as a graph matching problem.;1. Distributed SVM and Online SVM. Support Vector Machine (SVM) is an effective classification technique widely used in various machine learning applications. While having a theoretically guaranteed good classification accuracy, training an SVM can be challenging when the size of the training set is large, which is often the case in the big data era. For instance, a huge amount of data might be obtained in distributed sites and communication between different sites could be expensive. Thus a distributed SVM algorithm with low communication cost is desired. It is also possible that the size of the training set might be too large to fit into RAM. Thus an online SVM algorithm with low space complexity is expected. a. For distributed SVM, we aim to design communication-efficient algorithms. In this work, we first provide a lower bound on the communication complexity of distributed SVM in the coordinator model, based on a reduction from the k-OR problem. Then, we present an algorithm that reaches this bound, and has the desired optimality of both convergence speed and quality of solution. Finally, we give a robust version of the algorithm which can explicitly avoid the influence of outliers in the training set with an improved communication cost, based on a faster distributed top-t selection algorithm. b. For online SVM, we aim to find a space-efficient algorithm. In this work, we give an online SVM algorithm that requires only constant space with respect to the size of the training set.;Both developed algorithms are theoretically guaranteed to output a (1-epsilon)-approximation of the corresponding problem, and converge in optimal number of rounds. 2. Clustering in the Presence of Large Amount of Sparse Noise. We consider the problem of clustering with a heavy amount of sparse background noise (over 80% of the data). We propose a practical density based clustering method, combining with the strength of Minimum Spanning Tree based clustering, Locality Sensitive Hashing and Deep Neural Networks. Our algorithm first uses MST based clustering for high confidence label initialization, and then learns a mapping from the original data space to a relatively low dimensional latent space as a deep neural network, based on the labels obtained from the MST clustering. In the latent space, data are clustered based on local density, following the assumption that a point is more likely to belong to a cluster if its distance to the nearest neighbor in that cluster is small. We use locality sensitive hashing as the nearest neighbor search engine to handle the possibly high dimensional data. Experiments on both synthetic and real world datasets (with added noise) suggest that our method is capable of handling heavy noise and significantly outperforms popular existing methods. 3. Alignment of Protein-Protein-Interaction (PPI) Network. In biology, two proteins within a cell are said to be interacting if they are close to each other geometrically. Thus the PPI network is an undirected node-edge graph with some extra biological information stored at each node. Studying the alignment of two PPI networks are promising in discovering new functionality of different proteins. Since the alignment of PPI network can be treated as a generalized graph isomorphism problem which is NP-hard and notoriously challenging, current research has focused on finding algorithms with good practical performance. We developed a geometry based alignment algorithm for the global alignment of two PPI networks, which consists of a geometric step of embedding the network into a low dimensional Euclidean space and using geometric matching methods to find a matching, and a min-cost-max-flow step to handle the sequence similarity information stored in each node. Unlike other popular alignment algorithms which are either greedy or incremental, our algorithm globally optimizes the problem to yield an alignment with better quality.
Keywords/Search Tags:Problem, Machine learning, Algorithm, SVM, Alignment, Geometric, PPI, Training set
Related items