| Data-driven artificial intelligence technology can extract useful information from massive amounts of data,helping businesses boost productivity,enhance customer experiences and services,and better serve the user.However,data often contain a large amount of personal sensitive information,such as consumer records,medical health,and so on.With the strengthening of people’s awareness of personal information protection and the successive introduction of relevant laws and regulations,enterprises and organizations must collect,store,and use data within the scope permitted by law,and must not randomly share and disclose data,making data between different enterprises and organizations like isolated islands that cannot be interconnected and utilized to a greater extent.How to break through "data islands," connect multi-party data sources in compliance with legal requirements,and leverage the value of massive data has become a hot topic of concern in the academic and industrial fields.Ensemble decision tree learning is a commonly used machine learning algorithm that has the characteristics of non-parametric,strong interpretability,and adaptability to high-dimensional data.It is widely used in fields such as finance and healthcare.However,in these fields,data is often strictly confidential,which means that different financial or medical institutions cannot directly integrate the data they hold for joint modeling.Therefore,research is needed to train and predict tree-based ensemble models while satisfying privacy protection.According to different privacy protection constraints,tree-based ensemble models that meet privacy protection can be divided into two scenarios:in the federated learning scenario and in the provable secure scenario.This paper investigates treebased ensemble models in two different scenarios and achieves the following research results,addressing the shortcomings of existing research.(1)In the vertical federated learning scenario,tree-based ensemble model currently has privacy issues caused by the exposure of gradient aggregation values and the leaf node of the prediction sample.To address these issues,this paper proposes an information gain calculation algorithm based on semi-homomorphic encryption systems and multiplication perturbation techniques,and a decision tree prediction algorithm based on leaf node selection vectors,which hide the gradient aggregation values and the leaf nodes of prediction samples during model training.Furthermore,the proposed solution’s resistance to corresponding attack algorithms is further verified.In addition,to improve the model’s training efficiency,this paper proposes an aggregation value calculation optimization technique based on merging frequent co-occurring subsets.This technique uses the existence of common subsets between different feature bins to merge some homomorphic ciphertext addition operations,thereby reducing the model’s running time.The experimental results on multiple real-world datasets show that compared with existing methods,the federated learning algorithm proposed in this paper for tree-based ensemble models significantly improves privacy strength and model training efficiency.(2)In the provable secure scenario of tree-based ensemble models,current methods suffer from problems such as a large volume of redundant zeros in node representation and lack of support for supervised encoding of discrete classification features,resulting in long model running time and low efficiency.This paper proposes a secure multi-party computation algorithm called ENTENTE for ensemble tree models.In this algorithm,an anonymous sample ID representation method is used instead of the secret sharing bit vector representation method,and a combination of multiplication and addition-based information gain comparison algorithms is used to complete the training of the ensemble tree model.To address the classification feature issue,a feature pre-processing module based on supervised encoding is proposed to replace the use of one-hot encoding for classification feature encoding.Experimental results on multiple realworld datasets show that,compared with existing methods,ENTENTE significantly reduces the running time of tree-based ensemble models in provable secure scenarios. |