Font Size: a A A

Research And Implementation Of Distributed Machine Learning Platform Based On Spark And Pu-learning

Posted on:2020-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y F JiangFull Text:PDF
GTID:2428330575957076Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the development of Internet technology,the total amount of enterprise data is increasing year by year,and it is regarded as the core of resources and wealth in enterprises.In order to discover the potential value of data,machine learning will continue to play an important core role.In industry,large enterprises typically build distributed machine learning platforms that provide more efficient machine learning services.However,the current distributed machine learning platform still has the following problems:1)The commercial distributed machine learning platform for the external service of the industry is currently built in a specific cluster environment within the enterprise and is difficult to deploy externally.Therefore,for national organizations and enterprises such as the National Health and Health Commission,which have high security and security requirements for data,there are concerns when placing data on such commercial platforms for analysis and requires a machine learning platform that can be deployed in the internal environment of these national organizations and enterprises.2)At the algorithm level,the National Health Commission has the large amount of unlabeled data,in pre-pregnancy eugenics data,it is necessary to "follow-up" to determine whether the newborn has a defective disease and a large number of people have not "follow-up".There are silent potential populations in these populations.This is the problem solved by the PU learning algorithm(Positive and Unlabeled Learning,referred to as PU-learning,a kind of semi-supervised learning algorithm)based on positive and unlabeled samples.However,the PU learning algorithm currently is only implemented in a stand-alone version,which hinders the platform from analyzing and processing such data.In response to the above questions and needs,this thesis conducts in-depth research on distributed machine learning platform and related technologies.The main contributions of this thesis are as follows:1)Through the in-depth study of the research status of PU learning algorithm and Spark technology framework,we designed and realize a Spark-based PU learning algorithm(puSpark),which can process large-scale data efficiently;2)Research and implement a PU learning algorithm extension framework(igBBPu)to further improve the accuracy of the algorithm,which mainly integrates Bagging and Boosting techniques and uses the weighting update strategy based on mutual information to optimizes the algorithm;3)Research and implement a distributed machine learning platform,which can Provide a complete one-stop algorithm construction service,and the platform is built on the open source distributed computing framework Spark,so it can be deployed on the internal cluster of the Health and Welfare Committee to help better protect data security.It is mainly composed of three modules:resource management module,algorithm implementation module and log audit module.It also has built-in algorithms and other basic machine learning algorithms proposed and implemented in this paper.Based on it,we can drag and drop different machine learning task processes in the front-end,realize the model construction of machine learning algorithms and query and analyze the results.
Keywords/Search Tags:machine learning platform, Spark, PU-learning
PDF Full Text Request
Related items