Font Size: a A A

Research On Distributed Data Mining Methods Based On Differential Privacy

Posted on:2023-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhongFull Text:PDF
GTID:2568307061953959Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of data collection and sharing technology,collecting business data distributed on different terminals for analysis and modeling has become an important form of big data mining.However,these terminals may belong to different institutions and do not trust each other.With the increasing attention to data privacy,how to mine the knowledge contained in the global data set under the premise of protecting the data privacy of each terminal has become an urgent problem to be solved.To solve the above problem,this thesis focuses on frequent itemset mining and decision tree classification mining in the distributed scenarios,studies distributed privacy-preserving schemes based on differential privacy,and achieves effective extractions of global association patterns and classification patterns while taking into account data privacy of all parties.The main work of the thesis is as follows:(1)Aiming at the privacy-preserving top-k frequent itemset mining problem in the distributed scenario,a privacy-preserving mining method DP-DFIM based on differential privacy is designed,which mines the frequent itemsets by setting the central node to aggregate the noisy support count of the itemsets of all parties.In order to maintain the utility of the support count,a post-processing scheme is designed based on the order constraint of the support count,which improves the accuracy of the support count.To further reduce the influence of noise,the noisy support count is modified based on the similarity between the global support distribution and the central node’s support distribution to improve the quality of mining.(2)Aiming at the privacy-preserving decision tree classification problem in the distributed scenario,a decision tree construction method DP-DDTC that satisfies differential privacy is proposed,in which all parties send the noisy results of the count query to the server for summarization to determine the optimal splitting attribute.In order to ensure the utility of the results of the count query,an optimization scheme is designed based on the constraints satisfied by the query values,which improves the accuracy of query values.For the problem of excessive noise covering the true value,a targeted privacy budget allocation scheme is designed to control the signal-to-noise ratio.In order to further reduce the influence of noise,a metric is designed to measure the importance of attributes,so as to filter useless attributes,reduce the amount of injected noise and improve the mining accuracy.The experimental results based on real data sets show that the methods proposed in this thesis can ensure the utility of mining results while satisfying differential privacy.
Keywords/Search Tags:Differential Privacy, Distribution, Frequent Itemset Mining, Decision Tree
PDF Full Text Request
Related items