Font Size: a A A

Learning from perturbed data for privacy-preserving data mining

Posted on:2007-03-22Degree:Ph.DType:Dissertation
University:Washington State UniversityCandidate:Ma, JianjieFull Text:PDF
GTID:1448390005971332Subject:Statistics
Abstract/Summary:
In this dissertation, we concentrate on privacy-preserving data mining (PPDM) using post randomization (PRAM) techniques from distributed data. PRAM provides a general framework for randomization of categorical data. We estimate frequency counts from the randomized data by using moment estimation method or maximum likelihood estimation method. Normal approximation for the distribution of the estimator is also given. The variance of each estimator of frequency count is inversely proportional to the sample size in order.; Privacy preserved by using PRAM is quantified by gamma-amplification and probabilistic K-anonymity. Randomization causes some information loss, which can be quantified by metrics like distance between two distributions. Another important aspect of information loss is independence loss, which is also discussed in this dissertation.; The proposed method is applied to Bayesian network learning. We consider both structure and parameter learning. For structure learning, we face the familiar extra-link problem since estimation errors tend to break the conditional independence among the variables. We propose modifications to score functions used for Bayesian network learning, to solve this problem.; For continuous-valued data, an MGAS (Modified Agglomerative Scheme) discretization technique based on Hierarchical Clustering is proposed in this dissertation. MGAS technique discretizes numerical variables and indirectly enhances privacy. This technique has been applied to learn linear classifiers from randomized data for privacy consideration. Linear classifier is a model based on cost optimization. Instead of using the original cost function, the expectation of the cost based on the randomized data is optimized.; Finally, the proposed technique is applied to both association rule mining and decision tree learning. The supports of the K-itemsets in association rule mining can be estimated from the randomized data. By randomizing several items simultaneously, more simultaneous privacy breaches are limited by reducing simultaneous gamma -amplification. Privacy-preserving decision tree learning is accomplished by estimating frequency counts necessary for calculating information gain from randomized data.; Our experiments show that post randomization is an efficient, flexible and easy-to-use method to do Privacy-preserving data mining. Experimental results with different levels of randomization and different sample sizes show that this method produces an accurate model, even with a large level of randomization.
Keywords/Search Tags:Data, Randomization, Mining, PRAM, Method, Technique, Using
Related items