Font Size: a A A

Differentially Private Decision Tree Based On Pearson's Correlation Coefficient

Posted on:2022-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q YanFull Text:PDF
GTID:2518306485986009Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Decision tree based on Pearson's Correlation Coefficient(PCCDT)is a greedy algorithm in the field of decision trees.It is widely used in the field of pattern recognition and information retrieval.It not only can help medical institutions diagnose patients more accurately,but also companies or individuals make better decisions.However,when the dataset contains sensitive individual information(such as patient's diagnosis information,customer's shopping information,etc.)and is attacked by malicious users with certain background knowledge during the use of the decision tree,the privacy and security of individual will be threatened.How to protect sensitive information during the use of decision tree while ensuring the effectiveness of decision trees is a challenging research problem that needs to be solved.Therefore,this paper proposes a differentially private decision tree based on Pearson's Correlation Coefficient(Diff-PCCDT)for the PCCDT algorithm.The algorithm mainly consists of two parts: one part is select the best splitting attribute using exponential mechanism and another part is disturb class count in leaf nodes.Experiments have proved that by combining the allocation of privacy budget with the maximum depth of the tree,the operating efficiency and prediction accuracy of the algorithm can be improved to a certain extent.In addition,since PCCDT would encounter performance bottlenecks in big data scenarios,it still needs to improve its performance when processing large-scale data.For the Diff-PCCDT algorithm,when the amount of data is too large,its performance and accuracy will be decrease due to excessive noise.Therefore,in response to the above problems,this paper combines differential privacy with the parallel decision tree based on Pearson's Correlation Coefficient(MRPCCDT)under the Map Reduce framework to solve these problems.This paper names this method Differentially Private parallel Decision Tree based on Pearson's Correlation Coefficient algorithm(Diff MR-PCCDT).By dividing a large dataset into several disjoint small datasets,and then handling each small dataset separately,it can effectively improve the operating efficiency of the algorithm while reducing the loss of accuracy.This paper focuses on the privacy problems of the decision tree based on Pearson's Correlation Coefficient.Through a detailed analysis of the privacy leakage problem during the running of the algorithm,the corresponding solutions are proposed.The main research work is as follows:(1)Propose a differential privacy-preserving decision tree based on Pearson's Correlation Coefficient method Diff-PCCDT.For the first time,we analyzed the privacy problems in decision trees based on Pearson's Correlation Coefficient.This algorithm explained and analyzed how an attacker with strong background knowledge used the obtained Pearson's Correlation Coefficient and the class count in leaf nodes to infer the sensitive information of the individual in the dataset during the construction of the decision tree.It pointed out that the privacy leakage problem of the decision tree based on the Pearson's Correlation Coefficient is objective and serious.Regarding the privacy leakage problem of the algorithm,a privacy protection scheme was proposed.By taking the Pearson's Correlation Coefficient between conditional attribute and decision attribute of each intermediate node as the quality function,the exponential mechanism is applied to select the best splitting attribute,and then noise is added to the real class count of each leaf node,which protects the privacy in the construction of the decision tree.The experiment was carried out on 6small and middle-scale datasets(Adult,Mushroom,Sonar,etc.),and compared with the classic differentially private decision tree algorithm Diff P-ID3.After adjusting the two parameters of the maximum depth of the decision tree and the privacy budget,we evaluate the proposed DiffPCCDT algorithm in test accuracy.Results show that the Diff-PCCDT algorithm has more advantages in prediction accuracy.(2)Propose a differential privacy-preserving parallel decision tree based on Pearson's Correlation Coefficient method Diff MR-PCCDT in the scenario of big data.By combining the Diff-PCCDT algorithm with parallel decision tree based on Pearson's Correlation Coefficient under the Map Reduce framework-MR-PCCDT,the privacy protection problem under the parallel computing framework is solved.Since the parallel decision tree based on Pearson's Correlation Coefficient under the Map Reduce framework is to divide a large dataset into several disjoint subsets,and then handling the subsets respectively to construct the decision tree recursively.Therefore,the allocation of privacy budget can apply the parallel composition properties of differential privacy.On this basis,applying the Diff-PCCDT algorithm can reduce the cumulative amount of noise under large datasets and further improve the utility of the algorithm under large datasets.Finally,for the proposed Diff MR-PCCDT algorithm,a rigorous mathematical formula was used to prove that this method satisfies differential privacy.In the case of parallel computing,we conducted experiments on 9 middle and large-scale datasets.Due to the limits of direct methods for comparison,this paper directly compares Diff MR-PCCDT with Diff-PCCDT.Results show that although the algorithm is not much different from the Diff-PCCDT algorithm in test accuracy,it significantly improves the efficiency of the algorithm while ensuring privacy.
Keywords/Search Tags:Differential privacy, Decision tree, Pearson's Correlation Coefficient, parallel computing
PDF Full Text Request
Related items