Font Size: a A A

Research Of Factorized Learning Algorithms Based On Random Forest And Stochastic Gradient Descent

Posted on:2022-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:C WangFull Text:PDF
GTID:2518306572951059Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Machine learning applications are becoming increasingly important in enterprises.Most of machine learning algorithms require a single table as input data,while relational data in enterprise applications are typically normalized to meet the third normal form.The application of machine learning technology is faced with the problem of data representation mismatch,and researchers have to take the strategy of denormalizing data through primary/foreign-key joins to materialize those tables to a temporary single table as the input of machine learning algorithms,which is called“learning after join”.This strategy causes several problems: a wide table incurs redundancy avoided by normalization,increasing storage requirements;the training process introduces unnecessary calculation time cost.To solve this problem,attention on “factorized learning” is proliferating among researchers.Factorized learning restructures and improves the classical machine learning model,taking the normalized multi-table datasets as input data,avoiding the redundancy without sacrificing the quality of the original machine learning model.However,most of the studies on factorized learning are based on linear models,and only a few of them focus on nonlinear models.Moreover,most of the algorithms are highly dependent on matrix linear algebra operation,which is difficult to be applied to some powerful machine learning algorithms involving randomness,such as random forest and stochastic gradient descent,etc.On the other hand,some trivial steps of traditional machine learning,such as training set-test set partition,become a challenge in the field of factorized learning,as multi-table dataset cannot be easily partitioned.Previous studies simply ignored these problems by using datasets with fixed join schema,which greatly limits the generality and scalability of the model.This study focuses on factorized random forest and factorized random gradient descent on normalized multi-table datasets.For random forest,we introduce a sample-over-join algorithm and incremental decision tree EFDT to construct factorized random forest,with multi-table dataset as input.For stochastic gradient descent,we design a factorized index algorithm that extracts single data from the normalized multi-table dataset in any specified position,and construct factorized stochastic gradient descent with multi-table dataset as input.Extensive experiments with synthetic dataset and multiple real-world datasets reveal the improvement of those two factorized learning algorithms compared with traditional algorithms.The study proves that the factorized learning algorithm on large-scale datasets avoid the space and time redundancy caused by join operation,without any loss in accuracy and scalability.
Keywords/Search Tags:Machine learning, Factorized learning, Random forest, Stochastic gradient descent
PDF Full Text Request
Related items