Font Size: a A A

Privacy-Preserving K-means Clustering Algorithms Based On Negative Databases

Posted on:2020-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:X Y HuFull Text:PDF
GTID:2428330623466988Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,privacy protection is a key issue.As a classic algorithm for data mining,the original k-means algorithm is often used for analyzing plaintext,and does not consider privacy protection issues.There are many works study privacypreserving k-means algorithms,but they all have disadvantages,for example,high computational cost(such as encryption algorithms),high precision loss(such as data perturbation methods)and high communication cost(such as secure multi-party computation).As a new type of data representation,negative database stores the information which is not contained in traditional databases.The hard-to-reverse negative database can be used to protect the privacy of the original data.Also,negative database supports distance estimation and can be applied to k-means clustering algorithms to protect privacy.The existing privacy-preserving k-means algorithms upon negative database only support Hamming distance,but in reality,the similarity calculation that is most commonly used is Euclidean distance,which greatly limits the scope of application of this method.According to the above situation,this thesis used the characteristics that the coded bits in the Euclidean distance binary number coding method only related to the size of the attribute value and proposed a Euclidean distance estimation formula upon negative databases.On this basis,a new privacy-preserving k-means clustering algorithm based on negative database is proposed,and the negative database generation algorithm is improved to further improve the clustering accuracy.The main work is as follows:1)Proposed a Euclidean distance estimation method upon negative database.According to the characteristics of negative database,after the probability calculation by Bayes' theorem,the Euclidean distance between a hidden string and a real-time string can be estimated when the negative database and real-time string are known.The accuracy of the proposed Euclidean distance estimation method is demonstrated by experiments.2)Proposed a privacy-preserving k-means clustering algorithm based on negative database.The main idea is to use the K-hidden algorithm to protect privacy by generating negative database for each string,and use the Euclidean distance estimation method to calculate distance during the process of clustering.Experiments demonstrated our proposed privacy-preserving k-means clustering algorithm can achieve almost the same clustering accuracy as the original k-means algorithm while protecting privacy.3)Proposed a negative database generation algorithm(named QK-hidden algorithm),and constructed a fine-grained privacy-preserving k-means algorithm to improve clustering accuracy.On the foundation of the K-hidden negative database generation algorithm,the QK-hidden algorithm introduces a set of parameters to control the inversion probability of each attribute bit,so that the estimation accuracy and privacy of each attribute bit can be controlled.According to the characteristics of QK-hidden algorithm,the Euclidean distance estimation method for QK-NDB is proposed.The method is applied to privacy-preserving k-means algorithm and experimental results demonstrated that QK-hidden algorithm can improve the clustering accuracy of the original privacy-preserving k-means algorithm.
Keywords/Search Tags:Negative Databases, Privacy-Preserving, K-means Clustering, Euclidean Distance
PDF Full Text Request
Related items