Font Size: a A A

Research On The Privacy Protection In Classification Mining

Posted on:2012-10-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:G LiFull Text:PDF
GTID:1118330362450151Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, a lot of data has been collected by hu-mans. And it is becoming a problem that how to use that data. Data mining is a powerful toolto solve this problem.With the extending of the data mining application, privacy protection becomes an impor-tant problem for data mining. When doing data mining, it must be considered that protectingprivacy. Ordinary data mining methods assume that the data can be gotten directly. This assum-ing is incompatible with the using principles of the privacy data. In fact, for privacy protection,the data always can not be gotten directly. For solving this problem, it is need to study how tofinish data mining while the original data can not be gotten directly.This dissertation studies the privacy protection in classification mining. Recently, manymethods have been proposed to solve this problem, but the research of this field is still notenough, which is appeared in two respects. First, there are problems that can not be solved bythe existing privacy preserving classification methods in this field. For example, there is no goodenough method for privacy preserving neural network learning on distributed databases. So, itis difficult to use neural network for the data mining on the privacy data in practice. Second, thecurrent algorithms still have room to be improved. For example, the privacy preserving classifi-cation method based on singular value decomposition (SVD) has two main shortcomings. Oneis that the SVD–based method perturbs all of the samples and attributes to the same degree.Because different samples and attributes may need different degrees of privacy protection andmay do not hold the same importance for data mining, it is better to modify different samplesand attributes to different degrees. The other is that the SVD–based method only uses SVD toanalyze data. Because different data analysis methods analyze data from different perspectives,it will be better to use multiple data analysis method together.To overcome these weaknesses, this dissertation does study on the privacy protection inclassification mining, and proposes new algorithm for this problem. It has finished the followingworks.(1) There is no good enough privacy preserving neural network learning method on dis-tributed databases that found in the existing literatures. So it is difficult to train neural networkwith the requirement of protecting privacy in practice. To solve this problem, this dissertation proposed a privacy preserving back propagation algorithm for artificial neural networks learningon partitioned databases while protecting privacy based on security multi-party computation.This proposed algorithm privacy preserving trains neural networks by using information ex-change protocol based on secure multi–party computation to exchange the information neededby back propagation algorithm among nodes of the partitioned database.(2) DNALA is a DNA sequence anonymity method. DNALA firstly align sequences andcalculate distance matrix and then do clustering and generalization for sequences based on thedistance matrix. When aligning sequences, DNALA uses the multiple sequence alignment,which is time–consuming. And DNALA uses a greedy clustering algorithm, which precision isnot so high. And DNALA is not an online method, so it can not get result quickly when datais changed. To overcome these weaknesses, this dissertation improved the DNALA method.When aligning sequences, the multiple sequence alignment in DNALA is replaced by pairwisesequence alignment for all pairs of sequences to improve efficiency. The greedy clusteringalgorithm in DNALA is replaced by a hybrid clustering algorithm, which is comprised of anMWM-based algorithm and an online algorithm. The online algorithm has an efficiency advan-tage, especially when the database is updated. However, the accuracy of the results is not high.The MWM-based algorithm can achieve better clustering results with the same time complexityas the greedy algorithm in DNALA. The hybrid algorithm was designed to take advantage ofthese two algorithms. The online algorithm is used when the database is updated to quicklyobtain results, while the MWM-based algorithm is run periodically to improve the results.(3) The SVD–based privacy preserving classification method perturbs all of the samplesand attributes to the same degree. However, because different samples and attributes may needdifferent degrees of privacy protection and may do not hold the same importance for data min-ing, it is better to modify different samples and attributes to different degrees. To solve thisproblem, this dissertation proposed a new privacy preserving classification method based onSVD, sample selection and attribute selection. This method uses sample and attribute selectionto find the important samples and attributes. Then, this method perturbs the important samplesand attribute strongly and perturbs the other sample and attribute weakly.To solve the same problem, this dissertation proposed a privacy preserving classificationmethod based on weighted SVD. In this method, each sample has a weight to show its impor-tance for data mining. This dissertation improved the SVD–based data perturbation method tobe a weighted SVD–based one and used it to perturb data.The SVD–based privacy preserving classification method analyzes data only using SVD. If using multiple data analyses methods synthetically, the data can be analyzed more compre-hensively. Based on this idea, this dissertation proposed a new privacy preserving classificationmethod based on SVD and independent component analysis (ICA).(4) There are two kinds of privacy protection methods for classification mining. One is thealgorithm–related methods, the other is the algorithm-irrelevant ones. Each algorithm–relatedmethod is designed for particular classification algorithm and other classification algorithmscan not be used in it. As a comparison, multiple ordinary classification algorithms can be usedin the algorithm–irrelevant methods. Recently, the algorithm–irrelevant methods are all basedon the data perturbation. Randomization is the most used data perturbation method, but it doesnot be used for the algorithm–irrelevant privacy protection methods. This dissertation proposeda new algorithm–irrelevant privacy protection method based on randomization. It generates andopens a new data set that is different from the original data set independently as the perturbeddata. The perturbed data and the original data have the same distribution. Users get the modelsof the original data from the perturbed data.In conclusion, the main contribution of this dissertation is doing research for privacy pro-tection problem in classification mining, proposed new methods and improved some existingmethod.
Keywords/Search Tags:Privacy Protection, Data Mining, Classification, Data Perturbation, Singular ValueDecomposition
PDF Full Text Request
Related items