Font Size: a A A

Key Technology Of Privacy Preserving Data Publishing Based On Cluster

Posted on:2013-08-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:G M YangFull Text:PDF
GTID:1228330377459386Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The collection of digital information by governments, corporations, and individuals hascreated tremendous opportunities for knowledge-based decision making. Driven by mutualbenefits, or by regulations that require certain data to be published, there is a demand for theexchange and publication of data among various parties. The current practice primarily relieson policies and guidelines to restrict the types of publishable data and on agreements on theuse and storage of sensitive data. The limitation of this approach is that it either distorts dataexcessively or requires a trust level that is impractically high in many data-sharing scenarios.For example, contracts and agreements cannot guarantee that sensitive data will not becarelessly misplaced and end up in the wrong hands.A task of the utmost importance is to develop methods and tools for publishing data in amore hostile environment, so that the published data remains practically useful whileindividual privacy is preserved. This undertaking is called privacy-preserving data publishing.In the past few years, research communities have responded to this challenge and proposedmany approaches. While the research field is still rapidly developing, it is a good time todiscuss the assumptions and desirable properties for PPDP, clarify the differences andrequirements that distinguish PPDP from other related problems, and systematicallysummarize and evaluate different approaches to PPDP. This survey aims to achieve thesegoals.First, we proposed a kind of k-anonymity algorithm with outlier detection based ondensity cluster. The purpose of this algorithm is mainly to solve the generalization methods toachieve k-anonymity that need to exhaustive search solution space to find the optimal solution,which led to the complexity of the algorithm is NP-hard problem, and the information loss ofanonymitied data is too large and lose their utility. Existing clustering-based datadissemination algorithm for k-anonymity fixed cluster size as k, and did not consider thedistribution of the data set and whether there are outliers in the data set. Rather uniformdistribution of the real data set is just the ideal state, and the presence of outliers is a commonphenomenon. The algorithm in the clustering-based anonymity process takes full account ofthe loss of information and the handling of outliers, and in the case of anonymous dataminimum loss of information to enable publishers to better achieve the balance of data privacyprotection and utility. The specific process is as follows. First, we use the density clustering method to split the data set into clusters, while excluding outliers’ interference in the data setto be divided; Then, we adjust the cluster size using the information loss metrics, and excludeoutliers ignored in the data set is divided again. The ultimate aim is to ensure that theinformation loss is minimal and the data utility is maximal, and to strike a balance between thetwo. Finally, we have made some theoretical analysis the correctness and complexity of thealgorithm with different situations for the data set distribution, and use the experiment toverify the validity of the algorithm. We have done theoretical analysis for the experimentalresults.Secondly, for the data dissemination model of privacy protection for k-anonymity canprevent the connection attack, but cannot prevent the homogeneity attack and backgroundknowledge attack, we consider the different protection degree of sensitive attribute values inthree different (α, k)-Anonymous model, designed three clustering algorithm to achieve it.This is as follows. We define a single sensitive attribute value (α, k)-Anonymous model inorder to protect certain sensitive values of sensitive attributes, and defines a multi sensitivevalues (α, k)-Anonymous model for all sensitive attributes values protection. We defines asemi-supervised clustering (α, k)-Anonymous model for the realization for personalizedprotect for highly sensitive attribute values and low sensitive attribute value. On the basis ofthe full study data set similarity (distance) metric, according to the data characteristics we needto deal with, i.e., data contains both numeric attributes and discrete attributes, we gives adetailed data mapping and processing method, so that the data set related the distance canfacilitate the calculation, which completely avoiding the predecessors made the loss ofinformation as the distance between data points. The paper also gives the analysis for bothcorrectness and complexity of the algorithm. Finally, we tested the loss of information and theexecution time of the algorithm with experiment the, made a detailed analysis for experimentalresults in theory.Third, we propose a clustering-based (k, l)-diversity of data dissemination models anddesign an algorithms to achieve it. Previous l-diversity privacy preserving data publishingmodel to ensure that the number of different sensitive values within each anonymied cluster atleast l. However, only such a limit does not guarantee non-disclosure of privacy. Therefore, wepropose to limit the number of each cluster. In this chapter we try to measure the data objectssimilar for both discrete attributes and numeric attributes to use the probability jointdistribution, in order to improve the effectiveness of the clustering quality and anonymous data.We detail the strategy of cluster merger, restructuring and generalization in anonymous process of clustering, combining with the parameters k and l to put forward the concept of privacyprotection degree. Pointed out that the clustering-based optimization of (k, l) diversityalgorithm is NP-hard problem, and analyzied the complexity of the algorithm in theory.Theoretical analysis and experimental results show that this method can effectively reduce theexecution time and the information loss, and improve the query accuracy.Finally, we propose a data streams privacy protection framework based on weakclustering, given the data stream k-anonymity algorithm. We solve the problem that theexisting data distribution technology is mainly used to protect static data set, without takinginto account more and more data streams release situation. The algorithm is divided into theonline part of the line part. The online part is used to weak cluster data streams rapidly usingthe eigenvectors and clustering feature tree, which to divide the data streams into differentinitial cluster; The offline part is used to anonymity clusters and output it that each meetpredefined information loss and delay constraints, and remove the anonymous output of thecluster in the clustering feature tree, at the same time, the dynamic maintenance the changes ofclustering feature tree. Let the data streams to achieve k-anonymity at the same time to achieveprivacy protection. Finally, our experiments verify the effectiveness of the algorithm.
Keywords/Search Tags:Data publishing, Privacy preserving, l-Diversity, , k)-Anonymity, Clustering, Generalization
PDF Full Text Request
Related items