Font Size: a A A

Research Of Data Integration Based On Random Forest

Posted on:2024-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:P C XiangFull Text:PDF
GTID:2568307088951019Subject:Statistics
Abstract/Summary:PDF Full Text Request
Facing either enormous datasets that exceed the training capacity of a single machine or data stored in multiple clients that cannot be aggregated due to data privacy protection,researchers have provided methods to address multi-client joint training under distributed conditions,and federation learning is one of the major tools.The current mainstream federal frameworks based on neural networks and decision trees rely on the joint learning of multiple rounds of updates of parameters such as gradients among sites to achieve unimpressive performance.The complex learning process,high learning cost,and opacity of the training model limit the application of traditional federation learning methods in many distributed scenarios and distributed learning research still faces many difficulties.For realistic distributed environments with a small number of sites,restricted parameter transfer between sites,as well as inconsistent observation features across sites,current distributed learning methods appear to be weak and lack simple and efficient means to deal with them.For this reason,we hope to establish a simpler data integration framework based on Random Forest to effectively solve the distributed learning problem in the aforementioned scenarios.In this paper,we develop a one-shot federated transfer learning method using random forests(FTRF)to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites.Unlike the traditional federated strategy with multiple rounds of transmission for parameters such as gradients,FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites to ensure communication efficiency.Moreover,with the ability to assess the importance of variables in random forests,the information transfer due to FTRF can be interpreted through a similarity path between the target data and the auxiliary sites.Only fitted models from auxiliary sites are sent to the target site.Different from traditional model averaging,we combine the predicted outcomes from other sites and the original variables when estimating model averaging weights.By incorporating these augmented variables into a random forest,we obtain a variable-dependent weighting to better exploit models from auxiliary sites to improve prediction.Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity,which includes imbalanced and non-IID data distributions across sites and model misspecification.Six real-world data examples show that FTRF reduces the prediction error by 2-40% as compared to methods that do not make use of auxiliary information.
Keywords/Search Tags:Federated Transfer, Model Averaging, Random Forest, Adaptive Weighting, One-shot Communication
PDF Full Text Request
Related items