Font Size: a A A

Large-Scale Positive And Unlabeled Learning

Posted on:2018-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:P GaoFull Text:PDF
GTID:2348330512498176Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Machine learning methods based on positive and unlabeled samples are called PU learning.PU learning is widely used in practical applications.For example,enterprise will discover new customers according to the existing customers which are considered as positive data,which is also called "Lookalike".PU learning can be divided into two categories according to different application scenarios.One category is called PU classification,the other is called PU matrix completion.The first one mainly builds model for a specific task,such as "Lookalike" for a product.The second one mainly builds model for the relationship of two sets of entities,such as one-class collaborative filtering and multi-label learning.In many cases,in addition to the relationship matrix between entities,there are some auxiliary feature information we can get,such as user or product features in one-class collaborative filtering,in this case we can use the PU inductive matrix completion algorithm to get a better result.Existing PU learning algorithms are all implemented on stand-alone machine.However,in big data era,practical machine learning algorithms should have the a-bility to be distributed.This article designs and implements the distributed version of existing PU learning methods on spark.In addition,inspired from multi-task learning,we propose a new model which is called cluster PU inductive matrix completion.This article includes the following three contributions:1.We implement the distributed PU classification algorithms,including distribut-ed two-step methods and distributed cost sensitive methods.Based on the big data set of Lookalike task,we compare all the methods.Moreover,these algorithms has a certain degree of scalability.2.We implement the distributed PU inductive matrix completion algorithm,and conduct experiments on the benchmark data sets of recommendation system and multi-label learning.We find that the algorithm's scalability is very competitive.3.We propose a new method called cluster PU inductive matrix completion and design a distributed learning method for it.Based on the benchmark data sets of recom-mendation system and multi-tag learning,we compare our method with the state-of-art PU inductive matrix learning method.We find that our method has a better AUC,along with the competitive scalability.
Keywords/Search Tags:PU Learning, Classification Algorithm, Matrix Completion, Cluster Al-gorithm, Spark
PDF Full Text Request
Related items