Font Size: a A A

Privacy-Preserving Data Mining Method Based On Additive Noise And Euclidean Distance

Posted on:2012-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:L K LiuFull Text:PDF
GTID:2178330332999975Subject:Network and information security
Abstract/Summary:PDF Full Text Request
Recent more and more researchers are interest in the collection and monitoring of data using data mining technology for the purpose of security and business-related applications has raised serious concerns about privacy issues. For example, mining health data for the detection of discipline of disease out breaks may require analyzing hospital, clinical records and pharmacy transaction data of many individuals over a certain area. However, releasing and gathering such diverse personal's information belonging to different parties may violate privacy-protected laws and eventually be a threat to civil liberties. Privacy-preserving data mining(PPDM) strives to provide a solution to this dilemma. It intends to allow useful data patterns to be discovered without compromising privacy.With increasing concerns about privacy,researchers have developed various private-preserving data mining(PPDM) techniques. The previous works can be divided into two categories, data modification and data encryption. In 2000,Agrawal and Srikant proposed the addition of i.i.d. noise for privacy. However, Kargupta etal. pointed out that the additive noise can be easily filtered off and revealing a good approximation of the private data. This makes a researcher wonder about the possibility of using multiplicative noise. However, Kun Liu studied that how well an attacker can recover the original data from the transformed and prior information. He proposed four different attack techniques based on prior information,Known input-output attack, Known sample attack and Independent signals attack. At the same time,data encryption technology is also widely used in this field, but compared with data interference, it costs more time of computing and communicating. The application cost is also higher, but a small chance of being broken, it is more security. It is mainly applicated in distributed data mining.In the research of data interference, the main methods are Data Sanitation, Data Swapping, Data Generalization, Data Permutation, Data Blocking, Data Randomization. Some knowledge can be used, which are Linear Space, Statistical Theory, Probability Theory and Signal Process Methods. The most common way of data interference is additive noise method. With this method, using reconstruction technique to mine data, which rebuilds the distribution of perturbed data to keep same distribution as original data. But the time complexity of this technology is high, so Li Liu proposed a threshold method, which skipped the process of reconstruction. In recent years, in order to ensure the safety of the geometric data mining, linear transformation boom set off for a while, from the support vector machine, to a variety of distance-based matrix transformation. For the aspect of data encryption, the main methods are Homomorphic Cryptography,Oblivious Transfer and Secure Multiparty Computation, which use Public Key Cryptography Mechanism.After analyzing privacy-preserving data mining algorithms, we concluded the shortcoming of previous work, that is no matter noise addition perturbation or matrix multiplicative perturbation, both of them have the potential possibility of being attacked. In this paper, we use a two-step interference model, which combines addition and multiplication. First step is additive perturbation, and second is that using distance-based matrix transformation to perturb the obfuscated data,increasing the resistance to attacks. Based on similarity theory, we propose two novel additive perturbation algorithms:RDD and RACC. With the two methods, we can get higher mining accuracy.In the previous additive noise research, the common method is that first adding noise directly to original data, and then processing the obfuscated data, so that we can keep the distribute as original data or we can get higher mining accuracy to close the original data mining. In this paper, we chance the view to think of the noise addition process. We first mine original data and get which category the original data belongs to, then adding noise data. In order to keep data in the same category before and after perturbation, RDD and RACC are used to process the perturbed data. After the original classification data mining, we add uniform distribution noise and Gaussion noise. In this paper, we mainly process the records which exceed the domain that they belong to. RDD algorithm compute the distance "d" between the location of the new record and the border of category. Selecting a record as new record which is distance between the center and border of category minus "d". RACC algorithm uses the circular or spherical whose center is the center of category, radius is the distance between center and original record. In the circular or spherical, randomly choosing a point as the new record.Our work solves the following questions:1. RDD and RACC algorithms skip the re-construct process. Miner can directly mine the perturbed data, and get very high similiraty with original. Improving operational efficiency. At the same time, using this two algorithms, the distribution is not same before and after noise addition, so resist the attacks, rely on the distribution of noise, which filter noise.2. When the two-step model, which is combination of addition and multiplication applicate on geometric data mining, the additon uses RDD and RACC, while matrix multiplication uses orthogonal matrix. It protects the Euclidean distance, thus ensures the mine accuracy. This model also resists attacks based on prior knowledge.Finally, we use the open source data mining software WEKA and select three representative real data sets coming from the University of California. We choose different type and noise level to produce noise data to make experiments using RDD and RACC. We compare our results with Keke Chen's algorithm, the experimental results show that the accuracy of our algorithms is higher than Keke Chen's. When we use Gaussion noise, with the increase of noise level, RDD method get significantly larger range of data perturbations than RACC, because the num of exceeding the domain of data is more, so the privacy can be protected better. On the contrary, RACC has more disturbances space. RACC algorithm is more stable than the RDD.
Keywords/Search Tags:Privacy-preserving, Data Perturbation, Noise addition, Euclidean distance, Data mining
PDF Full Text Request
Related items