Font Size: a A A

Research On Privacy Preserving Data Mining

Posted on:2009-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:W J YangFull Text:PDF
GTID:1118360275954681Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The recent development of networking and storage technologies make it more and moreconvenient to collect, process or publish large volumes of data which also contains greatamount of personal privacy, business secrets and classified information. When the data isobtained, especially during the mining process, most of it can be used without any restriction.As a result, once the sensitive part is disclosed, it will seriously invade our privacy, disturbour normal life or even threaten the security of our society. Data mining, as one of the mostpowerful technology for knowledge discovery, reveals to us the hidden information and datapatterns from the normal data. Although it brings us knowledge and profits, there are severeproblems in its way of dealing with data. The concerns over data privacy increase extremelysince anyone accessible to the mining process can obtain the original data records, whichfurther leads to a high risk of data misuse.Therefore, in the recent years, a number of techniques have been proposed to solve theseproblems. In our research, we aim at providing a privacy preserving way of data mining bytransforming the original data sets before the mining process. We've also developed severalnovel transformation techniques, so that we can still get accurate mining results while theprivacy is well protected. We conclude our main contributions as following:1. We've proposed the essence of data privacy and two strategies for protection. In ourresearch, we analyzed most of the current privacy preserving methods, in which thestructure of the privacy objects are discussed in detail. We found that few of theirdefinitions can accurately describe the essence of data privacy, which makes it difficultfor the corresponding methods to provide a comprehensive protection. Based on thisunderstanding, we redefined data privacy by using data associations which are muchmore close to the actual concept of privacy in our normal life. We also proposed twokinds of strategies to protect the new privacy. Also, at the beginning of the thesis, weintroduced in detail the background knowledge of privacy protection and its field ofapplication.2. We've proposed a novel method of randomized anonymization to decompose the dataprivacy. Moreover, we've also proposed a mechanism to compromise between the level of accuracy and privacy, so that the threats from the priori knowledge are elimi-nated. In the scenario of data publishing, we proposed a method of data randomizationby applying our first strategy. It randomly replaces the data in each record by usingthe distribution of the original data. By comparing with the famous k-anonymizationtechniques, our method not only offers a much higher level of privacy protection, butalso maintains the useful knowledge in the original data set. Furthermore, the usermay use his priori knowledge to infer the sensitive information which he is not al-lowed to know. We also developed a method to counteract the threats from these kindsof knowledge in the problem of data publishing. While the method brings more un-certainties on the inference of original values, it also provides a mechanism to balancebetween the privacy and accuracy.3. We've proposed protocols of data transmission and data integration to transform dataprivacy, so that the threats from malicious adversaries are counteracted. Moreover,we've also implemented customized privacy. By applying the second strategy, we pre-sented an efficient clustering method for distributed multi-party data sets using theorthogonal transformation and perturbation techniques. The miner, while receivingthe perturbed data, can still obtain accurate clustering results. This method protectsdata privacy not only in the semi-honest situation, but also in the presence of collu-sion. Moreover, each attribute in a data set usually involves a certain level of privacyconcerns. It is necessary to provide the data owner with a mechanism to customize theperturbation of his own data. We implemented the customized privacy, so that eachvariable in the data set can be perturbed according to its own importance which isspecified by the owner.4. We've proposed an extendible privacy preserving method which adapts to differentnumber of participants. Moreover, we've also proposed a method to generate an inde-pendent perturbation. One of the main technical challenges for privacy preserving datamining is to make its algorithms adaptable to participants while still keeping the pri-vacy and accuracy guarantees. We analyzed the in?uence on the accuracy and privacyprotection when the participants increase in the normal method. And we also pro-posed an improved method to solve the problem with a large number of participants.Moreover, we also proved the importance of independent perturbation, and proposeda method adaptive to large data dimensions.
Keywords/Search Tags:Data mining, Privacy preserving, Randomization, Anonymization, Knowledge preserving, Priori knowledge, Orthogonal transformation, Customized privacy, Scalability
PDF Full Text Request
Related items