Learning from perturbed data for privacy-preserving data mining

Posted on:2007-03-22

Degree:Ph.D

Type:Dissertation

University:Washington State University

Candidate:Ma, Jianjie

Full Text:PDF

GTID:1448390005971332

Subject:Statistics

Abstract/Summary:

In this dissertation, we concentrate on privacy-preserving data mining (PPDM) using post randomization (PRAM) techniques from distributed data. PRAM provides a general framework for randomization of categorical data. We estimate frequency counts from the randomized data by using moment estimation method or maximum likelihood estimation method. Normal approximation for the distribution of the estimator is also given. The variance of each estimator of frequency count is inversely proportional to the sample size in order.; Privacy preserved by using PRAM is quantified by gamma-amplification and probabilistic K-anonymity. Randomization causes some information loss, which can be quantified by metrics like distance between two distributions. Another important aspect of information loss is independence loss, which is also discussed in this dissertation.; The proposed method is applied to Bayesian network learning. We consider both structure and parameter learning. For structure learning, we face the familiar extra-link problem since estimation errors tend to break the conditional independence among the variables. We propose modifications to score functions used for Bayesian network learning, to solve this problem.; For continuous-valued data, an MGAS (Modified Agglomerative Scheme) discretization technique based on Hierarchical Clustering is proposed in this dissertation. MGAS technique discretizes numerical variables and indirectly enhances privacy. This technique has been applied to learn linear classifiers from randomized data for privacy consideration. Linear classifier is a model based on cost optimization. Instead of using the original cost function, the expectation of the cost based on the randomized data is optimized.; Finally, the proposed technique is applied to both association rule mining and decision tree learning. The supports of the K-itemsets in association rule mining can be estimated from the randomized data. By randomizing several items simultaneously, more simultaneous privacy breaches are limited by reducing simultaneous gamma -amplification. Privacy-preserving decision tree learning is accomplished by estimating frequency counts necessary for calculating information gain from randomized data.; Our experiments show that post randomization is an efficient, flexible and easy-to-use method to do Privacy-preserving data mining. Experimental results with different levels of randomization and different sample sizes show that this method produces an accurate model, even with a large level of randomization.

Keywords/Search Tags:

Data, Randomization, Mining, PRAM, Method, Technique, Using

Related items

1	Privacy and utility analysis of the randomization approach in Privacy-Preserving Data Publishing
2	The Design And Implementation Of A Function Level Randomization Defensing Method Against ROP Attack
3	The Development Of Central Randomization Network System For Multi-Center Randomized Controlled Trials
4	Research On The Technology Of Data Structure Randomization
5	The Research Of Defending ROP Attacks Using Basic Block Level Randomization
6	Fuzzy Data Mining Technique In The Application Of The Atmospheric System
7	Research On Visualization Model And Its Applications In Data Mining
8	Research And Improvement Of Runtime Randomization Defense Method Against Memory Information Leakage
9	Research On Privacy Preserving Data Mining
10	Practical and theoretical issues in randomization