Font Size: a A A

Distributed Semi-Supervised Learning

Posted on:2021-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z XuFull Text:PDF
GTID:1368330614467713Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,with the development of computer technology and the improvement of hardware level,there are more and more methods and hardware devices to collect and/or store data.So,we may face the situation that a great amount of data are dispersedly collected/stored in multiple geographically distributed nodes.In such a scenario,distributed processing has been developed.Generally speaking,in distributed processing,each node in considered network not only accomplishes local computation based on its own data,but also exchanges limited information with its neighbors.In this way,each individual node can obtain the global information without transmitting any original data among nodes,so as to achieve the global processing in a fully decentralized manner.Recently,many distributed machine learning algorithms have been proposed in the domain of machine learning.However,the existing algorithms belong to supervised learning,which require sufficient labeled data to ensure learning performance.As we all know,in many real-world applications,collecting a large number of high-quality labeled data is difficult and/or expensive,and thus the collected data are usually unlabeled and/or weakly-labeled.Besides,due to the physical and human-made reasons,there may exist missing features in obtained data.Considering this,it is desirable to carry out the researches on distributed semi-supervised learning to explore unlabeled data in depth,such that learning performance can be significantly improved.We focus on four common sub-problems in distributed semi-supervised learning,that is,streaming data,multi-label data,missing data and partial labeled data.We overcome the difficulties met in decentralized implementation,and then propose the corresponding algorithms.The main works and contributions of this paper are summarized as follows.First of all,we consider the classification of streaming data,and then develop two distributed online semi-supervised support vector machine,which are respectively used for horizontally/vertically partitioned data.In our proposed algorithms,based on some anchor data,we propose a new form of manifold regularization to exploit the information of both labeled and unlabeled data,and then obtain the fully decentralized implementation of the global objective function.Besides,we use the sparse random feature map to approximate the kernel feature map.Thus,the model parameter can be explicitly expressed as a finite-dimensional vector,which avoids the transmission of the original data among nodes and protects the data privacy.We analyze the convergence and complexity of the proposed algorithm.Simulations on several data sets demonstrate the effectiveness of two proposed algorithms.Then,we consider the distributed multi-label classification over a network,and propose two distributed semi-supervised multi-label learning algorithm based on linear and non-linear discriminant function.In this propose algorithms,referring to information theoretic measures,we define a cost function,and then develop a label correlation term to exploit the relationship between a pair of labels.Considering that the losses caused by misclassification of different labels are different,the cost function is designed to be cost-sensitive.Besides,we employ the distributed matrix completion to distributively estimate the global label correlation term,and obtain the decentralized implementation of the global optimization problem.We theoretically analyze their performances on convergence and complexity.We compare the learning performance of our proposed algorithms with other existing algorithms.Simulation on several real datasets show that our proposed algorithms outperform the existing multi-label classification algorithms in most cases.We further consider the situation of the classification of incomplete data,and then develop distributed semi-supervised missing data classification algorithm based on subspace learning.In our algorithm,referring to the subspace learning,a joint missing data/nonlinear classifier framework is formulated.Then,a novel regularization is employed on the whole dataset,which is used to maximize the intra-distance between different classes,and minimize the inner-distance within the same class.The theoretical performance analysis and experiments on several datasets clearly validate that the performance of the proposed algorithm approaches that of the corresponding centralized algorithm,and significantly better than that of the existing missing data classification algorithms.Finally,we consider the situation of the ambiguous label information,and then propose a distributed semi-supervised partial label learning algorithm based on average-based disambiguation.In this algorithm,we develop a disambiguate strategy to select the correct label from candidate label set.To be specific,all the possible labels of unlabeled data are treated as the candidate labels in an equal manner at the initial state,and then the importance over training data and the groundtruth confidence of the candidate labels are adaptively estimated.Through sufficient iterations,the candidate label with highest conditional probability can be regarded as the correct label.The convergence and complexity analysis of the proposed algorithm are presented.The classification performance of the proposed algorithm is tested by a lot of experiments.Simulations results show that the performance of the proposed algorithm is significantly better than that of the existing partial label learning algorithms.
Keywords/Search Tags:Distributed learning, semi-supervised classification, random feature map, multi-label learning, missing data classification, partial label learning
PDF Full Text Request
Related items