Font Size: a A A

Research On Random Forest Algorithm Based On Differential Privacy

Posted on:2020-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:J M LiFull Text:PDF
GTID:2438330626453284Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of Internet and communication technologies in recent years has greatly promoted the maturity of big data and data mining technologies.As a classification algorithm commonly used in data mining,random forest algorithm is widely used in various fields,such as research institutions,commercial organizations,medical institutions,etc.to support data mining and analysis.However,in the process of mining and analysis,improper use of data can also lead to privacy disclosure.The problem of privacy disclosure has caused many organizations and individuals to be reluctant to provide more information,and data sharing has been greatly restricted,which has hindered the development of data mining technology.Designing the corresponding privacy protection strategy for the classification algorithm and providing privacy protection for the data to be classified has become an urgent problem to be solved in current data mining technology.Compared to traditional privacy protection technologies,differential privacy protection provides a more rigorous definition of privacy protection.It defines an extremely rigorous attack mode in which an attacker can gain maximum background knowledge.Differential privacy mechanism implements privacy protection by adding noise to the raw data or statistics of the data set.This advantage of differential privacy has attracted great attention from researchers at home and abroad.In this paper,differential privacy is applied to the random forest classification algorithm,and a random forest algorithm based on differential privacy is proposed to protect the privacy information in the data classification process.The research work of the thesis can be summarized as follows:(1)Differential privacy implements privacy protection by adding disturbance noise,which will result in a lower classification accuracy of the random forest algorithm.In order to reduce the impact of differential privacy protection on the accuracy of random forest classification,a hybrid decision tree algorithm is proposed.For the construction of a single decision tree in the random forest,combiningthe information gain in the ID3 algorithm and the information gain ratio in the C4.5 algorithm,a new attribute metric IG_GR is generated,which improves the classification accuracy of a single decision tree.(2)A new privacy budget allocation strategy is proposed for the random forest algorithm.For nodes at different depths in the decision tree,the privacy budget is allocated to its counting query and attribute query by weight,which will balance the signal-to-noise ratio of differential privacy technology to nodes at different depths in the decision tree.At the same time,the hybrid decision tree algorithm is applied to the construction of random forest,which balances the privacy and classification accuracy of the random forest algorithm based on differential privacy.(3)The above hybrid decision tree algorithm and the random forest algorithm based on differential privacy were tested on UCI's Adult and Mushroom data sets.The results show that the hybrid decision tree algorithm proposed in this paper has better classification accuracy than the existing decision tree algorithm.The random forest algorithm based on differential privacy can provide effective privacy protection while ensuring high classification accuracy.The work of this paper achieves a balance between privacy and classification accuracy,and has practical application value.
Keywords/Search Tags:differential privacy, decision tree, random forest, privacy budget
PDF Full Text Request
Related items