Font Size: a A A

Research On Ensemble Learning With Differential Privacy

Posted on:2022-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2518306485985919Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,massive amounts of data are generated every day,which greatly improves the performance of machine learning algorithms,but also exposes them to huge privacy threats.The privacy protection of machine learning has become a current research hotspot.Ensemble learning,an important branch of machine learning,is considered to be a machine learning explanation of crowd wisdom.Its main idea is to train multiple learners,then combine them according to a certain combination strategy,and finally output the final result by voting.Ensemble learning,due to its high precision and stability,is widely used in data mining,medical diagnosis,etc.However,today when personal privacy issues have attracted much attention,if the ensemble model or the knowledge extracted by it is directly published,the privacy of user's will be leaked.Therefore,how to ensure the performance of the ensemble model without revealing user's privacy is a very meaningful research problem.Differential privacy is a privacy protection technology supported by strong mathematical theory.Its main idea is to disturb the true value by adding random noise,which is simple and easy to implement,and it can also quantify the level of privacy protection.So far,differential privacy has become one of the most popular techniques for machine learning privacy protection.At present,there are some researches on differentially private ensemble learning,but still some challenges.Based on the existing study,this paper will make further research on Bagging,Random Forest and their privacy protection.Under the condition of achieving privacy protection while ensures the classification accuracy,this work is mainly carried out from the following three aspects: First,to improve the performance of the non-differentially private base classifiers.Second,to optimize the privacy budget allocation strategy and increase the utilization rate of privacy budget.Third,differentially private ensemble model is pruned according to the quality of differentially private base classifiers,rather than simply combining all the differentially private base classifiers as in previous research work.Based on the above considerations,this paper studies the differential privacy protection method of incremental Bagging and the differential privacy protection method of two-stage Random Forest in Chapters 3 and 4,respectively,details are as follows:An incremental Bagging algorithm based on differential privacy is proposed.In order to improve the classification accuracy of the differentially private Bagging algorithm,how to increase the diversity of base classifiers and optimize the allocation of privacy budget are taken into account firstly.In the generation stage of training data sets,Bag of Little Bootstrap and Jaccard similarity coefficient are used for sampling and preprocessing,respectively.Which will increase the difference between training data sets.To improve the utilization of privacy budget,an adaptive privacy budget allocation strategy is designed.Then,considering that the noise introduced by differential privacy may completely cover the true value,the usability of the differentially private base classifier may be too low to guarantee the performance of the ensemble model.Compared with traditional differential privacy ensemble learning,in the combination stage,this method selects a subset of differentially private base classifiers through a certain criterion and combines them into the final differentially private ensemble model,this process is also called differentially private ensemble pruning.A two-stage Random Forest algorithm based on differential privacy is proposed.The existing correlation-based random forest algorithms directly use the most relevant attribute as the splitting attribute when building decision trees,and do not select the optimal one from some promising attributes through a certain criterion.However,the most relevant attribute may be disparate with the desired one.For example,the Information Gain criterion has a preference for the attribute with many distinct values.To overcome this problem,this method not only considers the correlation between attributes,but obtains a candidate splitting attribute subset through a criterion.And use a Boolean test function on the candidate splitting attribute subset to divide samples into left and right child nodes.In order to prevent privacy leakage,the Exponential Mechanism is used to select the partition plane and the Laplace Mechanism is used to disturb the leaf nodes in the process of constructing the decision tree.Considering the complementarity between base classifiers,a twostage random forest algorithm based on differential privacy is proposed.A large number of experiments are conducted on real data sets to evaluate the classification effects of the above two algorithms.Not only are they compared with other methods,but also the privacy protection and experimental results are theoretically analyzed.
Keywords/Search Tags:Differential privacy, Ensemble learning, Bagging, Random forest, Classification
PDF Full Text Request
Related items