Research Of Privacy-Preserving Clustering Algorithm Over Distributed Data

Posted on:2010-10-17

Degree:Master

Type:Thesis

Country:China

Candidate:D F Yang

Full Text:PDF

GTID:2178360272491579

Subject:Computer applications

Abstract/Summary:

Privacy-Preserving data mining in distributed or grid environment is an important and hot research field in recent years. Data mining in distributed environment is different from centralized environment, and obviously a solution is needed to be found by Privacy-preserving data mining to such a contradiction that without disclosing their private data to each other, each user having a private data set and they just wish to collaboratively discover patterns on the union/combine of their private data sets. Therefore, we need to put forward a novel approach to discover Privacy-preserving patterns without share data, which can be called Privacy-Preserving Data Mining(PPDM).There are many approaches which have been adopted for PPDM. They can be classified into two major categories: Cryptography-based techniques and Randomization-based approaches. In this thesis, we focused on the former. Secure multi-party computation (SMC) is an active research in the field of information security and distributed computation in recent years and has very important research significance in the theory and application. Secure multi-party computation refers to the problem where two or more parties want to jointly compute a task based on their private inputs, while no party is willing to disclose his privacy to any other one.Privacy-preserving data mining solutions have been presented both with respect to horizontally and vertically partitioned databases, in which either different data objects with the same kind of attributes are owned by each party, or different kind of attributes for the same data objects are owned by each party, respectively. We introduce the notion of arbitrarily partitioned data, which generalizes both horizontally and vertically partitioned data. In arbitrarily partitioned data, different attributes for different items can be owned by either party. Although extremely "patch-worked" data is unlikely in practice, one advantage of considering arbitrarily partitioned data is that protocols in this model apply both to horizontally and vertically partitioned data, as well as to hybrid that are mostly, but not completely, vertically or horizontally partitioned.For arbitrarily partitioned data sets, we propose a Privacy-Preserving Distributed k-Means Clustering over Arbitrarily Partitioned Data. For large data sets which are confronted by data mining, we propose a Privacy-Preserving Distributed BIRCH Clustering over Arbitrarily Partitioned Data. The algorithms use Scalar Product Protocol, Index of the Minimum Value of Vector Sum Protocol, Secure Multiparty Multiplication Protocol, Secure Comparison Protocol and so on to achieve the privacy protection. Through experiments, the correctness of the algorithms based on this paper are verified by comparing the results with centralized data sets. At the same time, the comparison of running time cost demonstrates the feasibility of the algorithms in this paper. Finally, communication cost and computational complexity are analyzed in the thesis.

Keywords/Search Tags:

distributed data mining, privacy-preserving, k-means clustering, BIRCH, secure multi-party computation

Related items

1	Research On Privacy Preserving K-means Clustering Algorithm
2	Research On Privacy-Preserving Application Protocol Based On Secure Two-Party Computing
3	Research On Distributed Data Mining Based On Privacy Preserving
4	Research On Privacy Preserving K Nearest Neighbor Classification Algorithm
5	Research On Several Secure Multi-Party Computation Problems And Applications
6	Study On Privacy Preserving Classification Data Mining
7	Research On Privacy Preserving Distributed Clustering Algorithm
8	Analyzing And Researching Based On Privacy-Preserving Clustering
9	Research On Data Privacy Preservation Technologies Using Secure Multi-Party Computation
10	Research On Privacy Computing And Data Consistency In Distributed Scenario