Font Size: a A A

Privacy-preserving Research Of The Distributed Clustering Algorithm Based On Density

Posted on:2010-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:R J LiuFull Text:PDF
GTID:2178360272497476Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development and maturity of web technology and distributed database technology, distributed clustering design technology has received increasing attention. Many distributed clustering algorithms have been put forward, such as, RACHET,DBCA,DBDC and so on. These algorithms can resolve distributed and large scaled dataset clustering problem, they can save a great deal of time comparing with centralized clustering algorithms. However, data privacy and information secure problem hasn't been noticed in them, so they all reveal the privacy in the course of clustering. In this respect, distributed clustering algorithm based on privacy-preserving named PP-DBDC is put forward in this paper.Using privacy-preserving technology based on horizontal distribution, PP-DBDC algorithm has improved DBDC which is distributed clustering algorithm based on density. In order to preserve data privacy, the un-trusted third party as the main party is introduced in this algorithm. Furthermore, the obscuring data technique and the secure multi-party protocol for the distance between two points which is put forward in this paper are used in the course of clustering, so that the real data information can be protected and not transmitted to other parties. In this way, the data can be hidden well and the privacy can be protected.The key of the obscuring data technique is efficient integration of specific obscuring means and existing mining ways. The technique transforms the initial data to make sure they can't be estimated from the transformed data. Accordingly, the purpose of obscuring data and hiding privacy information can be achieved. In PP-DBDC algorithm, obscuring data as the noise of actual data is produced randomly in every slave site, and transmitted to other sites, so as to protect the actual data will not be found.Secure multi-party computation is a kind of distributed computation protocol in order to complete some computation task. Based on the idea of security multi-party sum of security multi-party computation, this paper advances the secure multi-party protocol for the distance between two points as follows:For the convenience, assume that the two parties involved in the clustering computation (which can be extended to multi-party), both parties respectively have one point (x1, y1) and (x2, y2), (the case of multi points is the same as it), secure multi-party protocol for the distance between two points can be carried out as following steps.(1) Each of the two parties generates a random obscuring data d1 and d2.Make the temporary variable (x1', y1') and (x2', y2') equal to the coordinate value plus the obscuring data of its own party.(2) Transmit d1 to P2 and d2 to P1and then make the obscured data (x1'', y1'') and (x2'', y2'') plus the temporary variable of its own party and the obscured data of the other party.(3) Transmit (x1'', y1'') and (x2'', y2'') to the un-trusted third party.(4) The obscured distance of the two points computed by the un-trusted third party get the value same as the real distance of the two points so as to get the exact computing result. Assume that users of m parties are involved in distributed clustering computation and parties do not hope to expose their privacy in the course of clustering, but they want to get the exact clustering result. In the PP-DBDC algorithm, firstly different parties generate random obscuring data, and broadcast it to other m-1 slave parties. After the different slave parties receive the obscuring data of other parties, they will compute the obscured data according to their real data and then send them to the main party i.e. the un-trusted third party. The main party will gain the exact distance between the two points of the received obscured data with the secure multi-Party protocol for the distance between two points, thereby the global clustering will be achieved with the DBSCAN algorithm and then the obscured data and clustering identity will be sent to corresponding parties. It is thus clear that PP-DBDC has no effect on the clustering effect.In PP-DBDC algorithm, multi-party users involving computation generate the obscuring data randomly, then send them to other slave sites and send obscured data to the un-trusted third party UTP. After UTP receives data, it does not know the value of the obscuring data in every party, so the original data value cannot be deduced. Moreover, the different slave parties only received the obscuring data of other parties and the value of their original data cannot be seen, the obscured data are not received by them, either. So, the original data of other parties cannot be deduced. So the algorithm meets the needs of the privacy-preserving restriction.In order to meet the intention of the privacy-preserving, the PP-DBDC algorithm needs to transmit obscuring data. As to the circumstance of the clustering involvement of m parties, the obscuring data will be transmitted to other (m-1) parties. Compared with the original DBDC algorithm, the time spending increased. However, in the data mining based on privacy-preserving, the three indexes of security, accuracy and efficiency cannot get the highest at the same time. Any two indexes are in a state of mutual restriction, so the improvement of one performance will result in the decline of the other. In the algorithm of PP-DBDC, the security and accuracy of the algorithm are guaranteed and some efficiency of time is sacrificed.In this paper, there is an experiment using synthetic data sets in the local area network to PP-DBDC algorithm. The result show that the algorithm can get the right clustering result to distributed dataset. In the experiment of efficiency, we find that the execution time of PP-DBDC is shorter than that of DBSCAN, and the saving time is more with the more sites. But comparing with DBDC, the execution time of PP-DBDC is longer, and it's longer with the more sites. The reason is that PP-DBDC need transmit obscuring data among the slave sites in the course of clustering, it uses some time comparing with DBDC. However, PP-DBDC only need transmit one obscuring data, but DBSCAN need transmit all data to the main site, so it will use more time. Theoretical analysis and experimental results show that the PP-DBDC algorithm is efficient.The PP-DBDC algorithm can efficiently solve privacy-preserving problem in distributed homogeneous dataset. It protects the privacy data information, so that the enterprises and individuals can get final clustering result not sharing own actual data. Distributed privacy-preserving technology has the wide range of applications in real life. Such as many institutions need cooperate to analysis the privacy data of criminal record, diseases record, card record and so on, they can use PP-DBDC algorithm to protect the privacy of enterprises and individuals and get the true analysis results.The distributed clustering algorithm based on privacy-preserving in this paper is for the horizontal partition dataset, it also fits the vertical distribution dataset. Our next work is to research and realize the distributed privacy-preserving clustering algorithm for the vertical partition dataset.
Keywords/Search Tags:Distributed data mining, Clustering, Distributed clustering, DBDC, Privacy-preserving
PDF Full Text Request
Related items