Font Size: a A A

Privacy and utility analysis of the randomization approach in Privacy-Preserving Data Publishing

Posted on:2009-03-20Degree:Ph.DType:Thesis
University:Syracuse UniversityCandidate:Huang, ZhengliFull Text:PDF
GTID:2448390002999462Subject:Computer Science
Abstract/Summary:
Randomization has emerged as an important approach for data disguising in Privacy-Preserving Data Publishing (PPDP). Due to different data it is applied to, the randomization approach falls into into two classes: Random Perturbation (RP) for continuous data and Randomized Response (RR) for categorical data. In PPDP, utility is an important metric and referred to the preservation of data mining information, while, as a more important metric, privacy is referred to the preservation of the original information. Privacy can be determined by different aspects, such as attribute correlations, randomization parameters, etc. However, in the aspect of the attribute correlations, no one has studied whether it is a factor affecting privacy and how it affects the privacy preserving property of the randomization; in the aspect of the randomization parameters, no one has investigated how to systematically compare different randomization parameters and what the optimal randomization parameters are so that the disguised data are most privacy-preserved but still useful for data mining computations.;This thesis addresses these problems. First, we identify that a key factor to affect privacy is the correlations among attributes. We propose two data reconstruction methods that are based on continuous attribute correlations. We have analyzed the relationship between data correlations and the amount of private information that can be disclosed based on our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be reconstructed more accurately, i.e., more private information can be disclosed. To improve privacy, we propose a modified randomization scheme based on the identified factor, the attribute correlations. Our experimental results have shown that, as the improved randomization method is used, the reconstruction accuracy of both reconstruction methods becomes worse, or less private information is disclosed. Second, for RR, we formulate the quantifications of privacy and utility as estimate problems. By using the quantifications to compare different RR schemes, we employ an evolutionary multi-objective optimization method to find optimal randomization parameters of RR. The experimental results have shown that our scheme has a much better performance than the existing RR schemes. Third, for RP, we first formulate an RP technique which is more general than the existing RP technique. After generaling RP technique, we discretize the data range and use a matrix to hold the randomization parameters. We also formulate the quantifications of privacy and utility for the generalized RP technique as estimate problems. Because to measure utility is expensive, we propose an efficient approach to approximate it. According to the privacy and approximate utility metrics, we utilize an evolutionary multi-objective optimization method to find optimal randomization parameters of RP. We show that our scheme to choose the parameters has outperformed the existing scheme.
Keywords/Search Tags:Randomization, Data, Privacy, Approach, RP technique, Utility, Attribute correlations, Scheme
Related items