Font Size: a A A

Differentially Private Data Mining Over Affinity Propagation Cluster

Posted on:2022-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:H B CaiFull Text:PDF
GTID:2518306485986069Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Affinity Propagation Clustering(AP)is a new type of clustering algorithm in the field of clustering.It is widely used in the fields of computer vision and computational biology,which can help researchers accurately classify animals,plants,and have a better understanding of population structure.However,when the dataset contains sensitive personal information(such as customer consumption records,income,etc.),in the face of attackers with certain background knowledge,personal sensitive information can easily be leaked during the clustering process.How to protect the sensitive information during the clustering process while ensuring the utility of the clustering results is a challenging problem that needs to be solved.Therefore,this paper proposes a Differential Privacy-Preserving Affinity Propagation Clustering(DP-AP)algorithm.The algorithm is mainly divided into three parts: neighborhood density calculation;the preference attribute value optimization;and the noise perturbing of the responsibility matrix.Experiments show that optimization of the preference attribute value according to the neighborhood density value can improve the operating efficiency to a certain extent and reduce the accuracy loss of the responsibility matrix after disturbance.In addition,due to the high complexity of AP,its performance still needs to be improved when processing large-scale data.For the DP-AP algorithm,if the amount of data is too large,it will reduce performance,causing excessive noise,and then affecting the accuracy of the results.Therefore,in response to the above problems,we combine differential privacy with the distributed affinity propagation clustering based on Map Reduce(Dis AP).This method is named as Distributed Affinity Propagation Clustering Based on Differential Privacy(DP-Dis AP).By dividing a large dataset into several disjoint small datasets,and perturbing each small dataset separately,information loss can be effectively reduced and the accuracy of publishing results,improved.This paper focuses on the privacy problems of the affinity propagation clustering algorithm.Through a detailed analysis of the privacy leakage problems in the operation of the affinity propagation clustering algorithm,the corresponding solutions are proposed.The main research work is as follows:(1)A differential privacy-preserving affinity propagation clustering method DP-AP is proposed.For the first time,we discussed the privacy problems in affinity propagation clustering.It pointed out that the privacy leakage problem in affinity propagation clustering is objective and serious.Regarding the privacy leakage problem of the algorithm,a privacy protection scheme was proposed.Before the disturbance,the preference value was pre-optimized according to the neighborhood density of the data point to pre-select the potential exemplars,and then adding laplace noise to the initial responsibility matrix.Under the same privacy budget ?,it can effectively reduce the information loss and the oscillation of the algorithm in the disturbance process,and simultaneously accelerate the convergence speed and accuracy of the algorithm.This paper conducts experiments on three datasets(Iris,Seeds,Synthetic),and compares the existing four algorithms: DP-SNNDPC,DP-KNNDPC,DP-DPC,and DP-FKNNDPC.The accuracy and efficiency of the proposed DP-AP algorithm are evaluated by four evaluation indexes: ARI,FMI,AMI and running time.The experimental results showed that the DP-AP algorithm has more advantages in terms of execution efficiency and data utility.(2)A differential privacy-preserving affinity propagation clustering method under the situation of big data DP-Dis AP is proposed.By combining the DP-AP algorithm with the distributed AP clustering algorithm based on Map Reduce(Dis AP),the privacy protection problem under the parallel computing framework is solved.As the affinity propagation clustering algorithm under Map Reduce divides a large dataset into several irrelevant subsets of similar size and then dispatches them to each node for parallel AP clustering,the parallel combination theorem of differential privacy can be applied to the division of privacy budget.On this basis,the application of the DP-AP algorithm can reduce the accumulation of noise under large-scale datasets,and further improve the utility of the algorithm under large-scale datasets.Finally,for the proposed DP-Dis AP algorithm,rigorous mathematical formulas proved that this method satisfy ?-differential privacy.In the case of parallel computing,we conducted experiments on the 3D-Cluster dataset and Swisse dataset in the different magnitudes(1000,2000,10000 and 100000).Due to the limits of direct methods for comparison,this article directly compares DP-Dis AP with Dis AP.The experimental results show that,under the condition of ensuring data privacy,comparing the DP-Dis AP algorithm with the Dis AP algorithm,the DP-Dis AP algorithm can still guarantee considerable accuracy and operating efficiency.
Keywords/Search Tags:Differential privacy, Affinity Propagation Cluster, privacy preservation, data mining
PDF Full Text Request
Related items