Font Size: a A A

Research On Privacy-preserving Methods In Data Sharing

Posted on:2015-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y B YuanFull Text:PDF
GTID:1318330518472867Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of computer technology,individual information available in digital form is increasing dramatically.Easy access to information makes daily life of most people more convenient.For example,social network allows us to get in touch with friends at any time and search engine makes us obtain a lot of information by merely moving the mouse.However,the data sharing processes often lead to privacy disclosure because most of the valuable data is private.The long-standing tension between data using and individual privacy gives arise to the research on privacy-preserving in data sharing to guarantee the security of released data and lose some data utility in raw data appropriately.Therefore,statistics and computer science are exploited to collect and use data while prevent the leakage of sensitive information.This dissertation researches privacy-preserving in data sharing,and proposes new algorithms for this problem.This paper aims to achieve these goals.Firstly,k-anonymity as a data dissemination model of privacy protection can prevent the connection attack,but cannot prevent the homogeneity attack and background knowledge attack.Moreover,most of these privacy preserving models focus on generalization and suppression technique,which lead to superabundant information loss.To minimize the information loss incurred in the anonymity process,we propose an enforced privacy-preserving paradigm of p-sensitive k-anonymity.The specific process is as follows:First,we use the nearest search method to split the data set into clusters;then publish each group individually for this paradigm.Finally,we make some theoretical analysis the correctness and complexity of the algorithm,and use the experiment to verify the validity of the algorithm.We perform theoretical analysis for the experimental results.Our preliminary experimental results indicate that our algorithm not only reduces the loss of the information but also results in better utility of anonymous data.Secondly,since the implementation of the traditional l-diversity models always have a lower efficiency or a greater loss on information,an implementation of l-diversity model based on improved clustering algorithm is proposed.The algorithm firstly calculates the variance of each attribute in the quasi-identifier attribute class.Secondly,the algorithm determines the weights of each attribute in data similarity calculation according to the variance.Thirdly,it performs constrained clustering based on the similarity among the data.Finally,the algorithm carries out data generalization on each data cluster and implements the l-diversity model.Furthermore,the correctness and complexity of the algorithm is also analyzed theoretically.The simulation experiments results show that the proposed algorithm can achieve/-diversity model with a smaller data loss and faster operational efficiency.Thirdly,we propose a pattern classification privacy preserving algorithm based on parzen window kernel density estimation on large-scale dataset(CPPPW).First of all,the probability density followed by the original large scale training set is estimated.Then we can construct la replacement training samples by the estimated probability density function where l is the number of the original samples and a is determined by 10-fold cross-validation.Kernel density estimation algorithm can make a more accurate estimate of the density function and protect the quality of the replacement dataset for adequate training sample.Furthermore,the performance of privacy preserving about the two algorithms is also theoretically analyzed and the analysis shows that CPPPW algorithm has stronger privacy preserving performance than ASN algorithm.Finally,the two sets of different simulation experiments show that three classic classification algorithms on replacement training samples have equivalent classification accuracy.Meanwhile,this algorithm can avoid privacy leakage of the original attributes for training a classification model on the replacement training samples.Comparing with ASN algorithm,the CPPPW algorithm not only has privacy preserve performance but also has higher precision and recall as well as better classification accuracy.Finally,pattern classification process involves the study on the original training samples,which easily lead to privacy disclosure.In order to avoid the leaks of privacy in pattern classification process and cause no effect on the performance of the algorithm,in this paper,we propose a pattern classification privacy preserve algorithm(CPPPCA)for sparse data based on primary component analysis(PCA).This algorithm extracts the principal component of the original training data and converts the original training samples to new samples corresponding to the primary components.Then,we train a classification model on the new samples.Furthermore,the performance of privacy preserving about the two algorithms is also theoretically analyzed and the analysis shows that CPPPCA algorithm has stronger privacy preserving performance than ASN algorithm.Finally,the two sets of different simulation experiments show that three classic classification algorithms on the replacement training samples have higher classification accuracy.Meanwhile,this algorithm can avoid privacy leakage of the original attributes for training a classification model on the replacement training samples.Comparing with ASN and WT algorithm,the CPPPCA algorithm not only has privacy preserve performance but also has higher precision and recall as well as better classification accuracy.
Keywords/Search Tags:Data sharing, Privacy preserving, Anonymization, Kernel density estimation, Primary component analysis
PDF Full Text Request
Related items