Font Size: a A A

Research On K-means Clustering Algorithm Based On Differential Privacy

Posted on:2022-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2518306737453504Subject:Mathematics
Abstract/Summary:PDF Full Text Request
With the development of big data-related industries,data analysis is widely used in fields such as digital healthcare,location services,and social networks.The k-means algorithm is a commonly used data analysis method,and the published cluster centroid points and the number of samples in the cluster can easily lead to the privacy of sample data.Differential privacy,as a new privacy model based on data disturbance,can provide provable privacy protection in the worst attack scenario.Therefore,research on the differential privacy k-means clustering algorithm is an important work.Firstly,to solve the problem of selecting the initial centroids of DPLloyd algorithm,this paper proposes a method of selecting the initial centroids based on grid partition.This method uses the spatial distribution information of the dataset,and avoids the subjectivity and randomness of selecting the initial centroids.By using the S1 dataset for verification,when the total privacy budget is large,the initial centroids selection method based on grid division significantly improves the problem of non-convergence of the DPLloyd algorithm.Secondly,for the privacy budget allocation problem of the DPLloyd algorithm,this paper presents a privacy budget allocation scheme in the form of Taylor series.Under the condition of a certain total privacy budget,the privacy budget of a single round increases with the increase of the number of iterations,thus improving the clustering quality of the later round.By using the S1 data set for verification,when the total privacy budget is small,compared with the average privacy budget allocation scheme,the privacy budget allocation scheme in the form of Taylor series can greatly reduce the value of the evaluation index NICV.Finally,this paper designs the DPCenk-means algorithm based on centroid plus noise and the DPSumk-means algorithm based on sum function plus noise,and theoretically proves that both algorithms meet the definition of differential privacy.Through the use of Iris,S1,Adult,and 3D Road Network datasets for verification,under the same conditions that the initial centroids selection scheme and the privacy budget allocation scheme are the same,the clustering effect of the DPCenk-means and DPSumk-means algorithms is better than the DPLloyd algorithm.
Keywords/Search Tags:data mining, clustering, k-means clustering, privacy protection, differential privacy
PDF Full Text Request
Related items