Font Size: a A A

Research On The Theory And Method Of Differential Privacy Synthetic Data Publication

Posted on:2019-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y F XiaoFull Text:PDF
GTID:2428330545491407Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,public and private organizations can collect a large amount of micro-data containing personal details,and the need to publish these data publicly to social and research institutions is gradually increasing.At the same time,the holders of these data are also faced with the pressure of publicly available data to prove their transparency.However,these microscopic data contain many personal sensitive information.If they are published without processing,personal privacy information of the data will face enormous leakage risk,therefore,before the release of data,the raw data needs to be processed to prevent the leakage of private information.Differential privacy model is the most widely used privacy protection data publishing technology today,and because of its strong privacy protection capabilities,it is applied to many areas of privacy protection.The differential privacy mechanism mainly achieves the purpose of privacy protection by injecting noise into the original data.However,in the face of high-dimensional data,existing privacy protection algorithms inject excessive noise,which results in poor data availability.Therefore,how to guarantee the availability of data while satisfying differential privacy is a daunting task.This dissertation focuses on the synthetic data publishing method of high-dimension data under differential privacy model,proposes a corresponding improvement method for existing algorithms that cannot effectively handle the issue of high-dimensional data publishing,and guarantees to improve the availability of synthetic data under the premise of satisfying differential privacy.The main work of this paper is as follows:(1)The Bayesian network structure learning method is studied.Aiming at the deficiencies of existing algorithms,a Bayesian network structure hybrid learning scheme based on dependency relationship and scoring function is proposed,which makes the low-dimensional edge of attribute in the Bayesian network can adequately approximate the full distribution of attributes in high-dimensional data.(2)For the sampling process of high-dimensional data,a differential privacy sampling algorithm based on Bayesian network is proposed.The low-dimensional edge distribution of attributes in the network is used to approximate the full distribution of high-dimensional data.Laplacian noise is injected into the edge distribution of the attributes in the network when sampling,and then sampled from it to generate a synthetic data set,making the sampling process in the low-dimensional space,which reduces the computational complexity.(3)The availability of synthetic datasets has been verified experimentally,and by comparing the relevant algorithms,the methods in this paper have a higher degree of similarity to the original datasets while satisfying the differential privacy constraints.
Keywords/Search Tags:privacy preserving data publishing, high-dimensional data, differential privacy, Bayesian network, synthetic data
PDF Full Text Request
Related items