Font Size: a A A

Research On Differentially Private Multi-party Data Publishing

Posted on:2020-11-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:P TangFull Text:PDF
GTID:1368330572973541Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of network technology and the popularization of intelligent devices,people's ability to generate and collect data is increasing.In reality,a large amount of relevant data are usually held by different owners(i.e.,parties).Collaborative publishing multi-party data is helpful for data analysts to fully extract the value of data and provide better data services.However,the data often contains a large amount of sensitive personal information of users,and direct releasing data without reasonable privacy protection will inevitably lead to serious privacy disclosure.Therefore,this thesis studies the problem of differentially private multi-party data publishing.Compared with the differentially private single-party data publishing,the requirements of personal privacy protection in differentially private multi-party data publishing are much severe.In addition,it should also be considered to improve the utility of entire synthetic dataset and reduce the communication cost between parties in the process of data publishing.Thus,three typical multi-party data publishing scenarios(i.e.,horizontally partitioned relational data,vertically partitioned relational data and multi-party sequential data)are considered in this thesis.The problems of differentially private multi-party data publishing are intensively studied,and the following innovative results are achieved.(1)To solve the problem of differentially private horizontally partitioned relational data publishing,a differentially private sequential update of Bayesian network(DP-SUBN)solution is proposed.In DP-SUBN,the parties and the curator collaboratively identify the Bayesian network that best fits the integrated dataset in a sequential manner,from which a synthetic dataset can then be generated.The fundamental advantage of adopting the sequential update manner is that the parties can treat the statistical results provided by previous parties as their prior knowledge to direct how to learn Bayesian network,which can improve the accuracy of learning results and reduce the complexity and sensitivity of learning algorithm.The core of DP-SUBN is the construction of the search frontier.To improve the fitness of the Bayesian network and reduce the communication cost,this thesis introduces a correlation-aware sear-ch frontier construction(CSFC)method.To privately quantify the correlations of attribute pairs without introducing too much noise,this thesis first proposes a non-overlapping covering design(NOCD)method,and then introduces a dynamic programming method to find the optimal parameters used in NOCD to ensure that the injected noise is minimum.Through formal privacy analysis,it shows that DP-SUBN satisfies ?-differential privacy.Extensive experiments on a real dataset demonstrate that DP-SUBN offers desirable data utility with low communication cost.(2)To solve the problem of differentially private vertically partitioned relational data publishing,a differentially private latent tree(DPLT)solution is proposed.In DPLT,the parties and the curator collaboratively identify the latent tree that best approximates the joint distribution of the integrated dataset,from which a synthetic dataset can be generated.The fundamental advantage of adopting latent tree model(LTM)is that,the connections between a small number of latent attributes that derived from each local dataset can be used to capture the cross-dataset dependencies of the observed attributes,such that the joint distribution of the integrated dataset can be learned with little injected noise and low computation and communication costs.To improve the accuracy of the generated latent attributes,a two-phase latent attribute generation(TLAG)is proposed.To quantify the correlations of latent attribute(LA)pairs with a small amount of injected noise,a tree index based correlation quantification(TICQ)is presented.To achieve differential privacy in an efficient manner,a distributed Laplace perturbation protocol(DLPP)is introduced.Through formal privacy analysis,it shows that DPLT satisfies ?-differential privacy.Extensive experiments on real datasets demonstrate that DPLT offers desirable data utility with low computation and communication costs.(3)To solve the problem of differentially private multi-party sequential data publishing,a differentially private distributed prediction suffix tree(DPST)solution is proposed.In DPST,the parties and the curator collaboratively identify the prediction suffix tree(PST)by starting from a root node,and then recursively splitting each node according to the joint judgment of whether its score is larger than a given threshold.The PST is used to approximate the statistical characteristics of the integrated dataset.From the PST,a synthetic dataset can be generated.To guarantee differential privacy,the value of the score of each node should not be disclosed.To solve this problem,based on homomorphic encryption technology and Yao's comparison protocol,a basic node splitting judgment protocol is proposed.In order to further reduce the protocol communication cost,an improved node splitting judgment protocol is proposed.Through formal privacy analysis,it shows that DPST satisfies ?-differential privacy.Extensive experiments on real datasets demonstrate that DPST offers desirable data utility with low computation and communication costs.
Keywords/Search Tags:differential privacy, data publishing, horizontal partitioning, vertical partitioning, sequential data
PDF Full Text Request
Related items