Font Size: a A A

Research On Distributed Streaming Analytics And Real-Time Machine Learning

Posted on:2021-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y C ChengFull Text:PDF
GTID:1368330602993445Subject:Computing application engineering
Abstract/Summary:PDF Full Text Request
A new era is emerging,and artificial intelligence systems have played an increas-ingly important role in this era.The exponential growth of hardware performance and the development of complex machine learning algorithms will further promote the de-velopment of intelligent systems.These systems will support autopilot technologies to reduce traffic congestion and reduce traffic accidents;they will help users get rid of in?formation cocoon room with real-time analysis on user interest drift,and get users the important information they need fast and intelligently;it will make the information sys-tems of enterprises and institutions safer with real-time detection and defense against financial fraud and Internet attacks;by identifying high-risk patients as early as pos-sible,using nanoprobes for cell-level diagnosis and treatment,and robotic surgery to revolutionize medical and health care.These emerging AI systems is changing the way people perceive and interact with the world,making everything humanize.To realize this vision,a new generation of AI systems are needed to support the protection of human safety and well-being.In addition to intelligence,these decision-making systems need to overcome at least three challenges.First,there must be real-time react(i.e.making decisions in seconds or even mill'iseconds)to support autopilot,intelligent information retrieval,and other emerging AI applications.Second,these sys-tems need to constantly learn from real-time data streams,because their environment will gradually changing.Finally,when these systems make decisions on behalf of hu-man,their decisions need to be interpretable.These challenges involve data,models,algorithms,hardware,and many other aspects.Thus,this research studied distributed streaming analytics and real-time machine learning,which involves software architecture and efficient algorithms,and important theoretical limitations on modeling methods,Learning from data steam,as new data arrives,the previous data is no longer available for modifying the earlier established sub-optimal models.To overcome all of the challenges,this research proposed a strategy which is summarized as"one core problem,two concerns,and three aspects":"One core problem" is the online optimization problem.Data streaming online,which means that the entire data is not immediately available,and individual data in-stances arrive in order.Many traditional optimization methods cannot continuous learn from the streaming data.This research explores approximate algorithms,adaptive learn-ing and second-order optimization methods try to solve the online optimization with re-source constrains,computational constrains,and complex inequality constraints.Adap-tive learning in this research refers to update prediction models online and adapt to concept drift.The steps of online adaptive learning can be summarized as:l)Make as-sumptions on distribution of the upcoming data stream;2)Identify change patterns;3)Design a mechanism to adapt the model;4)Parameterize the model at each step."Two concerns" refers to the real-time"model" concerns and real-time"feature"concerns.To achieve real-time machine learning,the machine learning features need to be input to the system in real time,so that the model can always be trained using the latest features.This research use the optimized stream processing engines to achieve real-time feature engineering,and immediately input the features to the model,which ensures that the learning process can be affected by the data stream in real time.The real-time "model" concern is mainly leveraged by improving the sparsity of the model,which is very important.For a recommender,the real-time features could only benefits some particular users,however,real-time model could quickly capture the global data changes and newly generated data patterns in the system."Three aspects" refer to data,computing power,and algorithms.(1)As for data,this research focus on streamingo bigdata with high veglcotiy,highdimension al and hgh volume features,try to extract a slice from streaming big data to get intuition(sense perception).In contrast to the batch processing on cloud,real-time machine learning is inheritly based on streaming analytics.(2)As for computing power,this research take"Tianhe-2 Supercomputer" as the computing resource pool for real time machine learn-ing to learn from streaming data.The challenges are resource elasticity and resource aware hyper-parameter estimation.(3)As for algorithms,this research focus on online algorithms which enable real-time learning and reasoning.Following the strategy of "one core problem,two concerns,and three aspects,this research studied real time machine learning and made the following contributions:1.This research analyzed the history and future of distributed streaming analytics and real-time machine learning problems,and proposed a corresponding research strat-egy"one core problem,two concerns,and three aspects",to overcome the challenges brought by emerging Al applications,such as autopilot and intelligent information re-trieval.2.This research proposed an optimal solution for dynamic resource scaling for dis-tributed streaming analytics on "Tianhe-2 supercomputer".The mathematical essence·of resource elasticity on streaming process is dynamic Portfolio optimization.There are multiple parallel streaming analytics running in the real-time machine learning systems,training multiple models.Different streaming analytics have different logical opera-tor topologies,dealing with different features.Their computing resource requirements vary.To deal with this complex situation,this research proposed an optimal algorithm framework HPC2-ARS.Firstly,this research investigated the static and elastic resource scheduling methods;then model dynamic resource scaling for distributed streaming an-alytics on "Tianhe-2 supercomputer".Based on this mathematical model,a streaming process delay estimation model is built.The challenge is a complex multi-objective optimization problem with user-defined stream processing delay constraints and sys-tem resource constraints.To solve it,this research designed a polynomial-time algo-rithm framework HPC2-ARS which include a utility function design mechanism and a new scalar method to turn the complex multi-objective optimization problem into a single objective optimization problem,and then,proposed a hybrid heuristic optimal re-source allocation algorithm based on the principle of maximizing marginal utility.The algorithm allocate resources to streaming process tasks according to three principles(Round Robin,Delay and Utility)in three different situations:survival(Round Robin),fair supreme(Delay),and efficiency supreme(Utility).The principles are guiding the competing resource requirements between parallel streaming analytics.3.This research studied the interaction between resource consumption and model performance.Firstly,this work focus on the challenge of optimal resource allocation on heterogeneous data sources in real-time machine learning systems.And proposed an optimal solution based on convex optimization theory.Secondly,by modeling the randomness and resource availability of real-time machine learning systems,a dynamic resource scaling algorithm(HPC2-ARS-D)based on approximate dynamic program-ming and Markov utility model is proposed,which has overcome the challenge of curse of dimensionality,and accurately characterized the time-varying nature of the real-time machine learning system,and efficiently reflect the impact of the system's resource con-sumptin on the performance of ML model.4.An online deep Bayesian recommender is proposed,which can perform all stages of training(including incremental training,hyper-parameter estimation,and de-ployment)in real-time using real-time streaming analytics and a variational gated re-current unit(GRU)network,clarified the asymmetry of real-time learning and real-time inference.Specifically,this research first investigated various deep learning-based recommenders and online recommenders.Then,mathematically modeled the user and item interactions with time information in the data streams.Meanwhile,analyze and model the concept drift in the stream data recommender.Based on these mathemat-ical models,a streaming deep Bayesian recommender is proposed,using the reliable mathematical tools provided by the Bayesian method to deal with the randomness and uncertainty of the system.To balance the sparseness,accuracy and interpretability of the streaming deep Bayesian recommender,this research utilize the mean-field approx-imation theory and the variational GRU to approximate the posterior probability dis-tribution of user and item interactions in real time.Variational GRU utilize the online variational inference of discrete events in continuous time to establish the relationship between the Bayesian process and the deep factorization model in the streaming data.Finally,the second-order method based on K-FAC is utilized to optimize the Evidence Lower Bound(ELBO)of the streaming deep Bayesian recommender.Experiments on multiple Benchmarks showed that the proposed streaming deep Bayesian recommender is more capable of capturing the concept drift over time than multiple Baselines.
Keywords/Search Tags:Distributed machine learning, Multi-objective and online optimization, Elastic resource scaling, Concept drift, Streaming deep Bayesian recommender
PDF Full Text Request
Related items