Font Size: a A A

Research On K-means Clustering Algorithm Based On Differential Privacy Protection

Posted on:2019-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:N LiuFull Text:PDF
GTID:2518306512956289Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is an important technique for obtaining useful information through machine learning,pattern recognition,and mathematical statistics.It can be widely used in various fields such as e-commerce,medical care,and market analysis.With the continuous deepening of data mining applications,the risk of privacy leakage has become a sensitive and prominent issue in data mining.K-means clustering algorithm is one of the widely used algorithms in data mining.Differential privacy protection theory has become an important branch of data mining privacy protection technology due to its strict mathematical model and no background knowledge constraints.Differential privacy protection is a privacy protection method based on data perturbation,K-means clustering analysis technology based on differential privacy protection can effectively reduce privacy disclosure,but it is easy to distort data,which makes the availability of data sets satisfying differential privacy protection decreased.Therefore,how to achieve better data availability under the premise of higher privacy has become the focus and difficulty of research.This dissertation focuses on the usability of differential privacy protection K-means clustering algorithm and the efficiency of the algorithm.Through in-depth analysis of the reasons leading to the lower availability of differential privacy protection for K-means clustering algorithm.An improved K-means clustering algorithm based on differential privacy protection is proposed,which can improve the availability of clustering results under the condition of ensuring data privacy,and optimize the algorithm to improve the operating efficiency.1)The randomness of the Laplacian noise leads to a large deviation of the center point,especially when the privacy budget parameter ? is small,and the availability of clustering results is poor.A differential privacy protection K-means clustering algorithm based on silhouette coefficient-SCDP K-means(silhouette coefficient based differential private K-means)clustering algorithm is proposed.The algorithm uses the silhouette coefficient to quantitatively evaluate the clustering effect of each iteration and add different noises to different clusters.In order to solve the problem that the complexity of the calculated silhouette coefficient is relatively high,the calculation of the silhouette coefficient based on the center point is used to ensure that the operation time of the algorithm increases steadily when the amount of data is increased.2)The clustering algorithm for privacy protection deals with the problem of low operating efficiency caused by large memory resource consumption when processing large-scale data,and completes the parallel design and implementation of the SCDP K-means clustering algorithm.The experimental results show that the parallel SCDP K-means clustering algorithm has good data availability while maintaining high privacy protection,and still has good running efficiency in large data sets.The problem of long computation time due to the introduction of the silhouette coefficients is effectively solved.
Keywords/Search Tags:data mining, differential privacy, K-means clustering, silhouette coefficient, MapReduce
PDF Full Text Request
Related items