Font Size: a A A

Research On Privacy Preserving Methods For Data Publication

Posted on:2016-10-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:1318330542974112Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,as the rapid development of data analysis and internet technology,more and more organization departments and scientific establishments have published large amounts of data to the public,in order to realize data sharing,statistics and data mining.However,there contains lots of sensitive information of individuals in most of the published data,such as disease,salary,interests information and so on,the data publishing process often accompanied by the risk of leakage of user privacy.At the same time,the privacy leakage in data publishing will greatly hinder the information publication and data sharing in the whole society,and cannot maintain the harmonious and stable development of the society.Therefore,before data publishing,it needs to involve data protection to reduce the probability of obtaining the users'sensitive information of the malicious attackers.Privacy preserving methods for privacy protection effect and data usability are studied in this paper,on the premise of protect the data security,specific to the application demand of the published data,data anonymization and data perturbation technology is used to protect the original data to balance the usability and privacy preserving of the published data.Specifically,the main contents of this paper can be divided into the next four parts:Firstly,traditional data anonymity models always construct equivalence class with high correlative sensitive values,and cannot resist correlativity attack.For this problem,a new data anonymization model named?s,l?-diversity which limited the correlativity of sensitive values in the equivalence classes was proposed.This diversity model was based on traditional l-diversity model,and measured the correlativity of the sensitive attribute values by the distribution of the quasi identifier attributes in the sensitive sets to bring down the information loss by equivalence classes with high corrective sensitive values.At the same time,a?s,l?-diversity clustering algorithm named SLCA was proposed to achieve?s,l?-diversity,the SLCA algorithm measured the distance between tuples by measuring the correlativity of attribute values,and could greatly bring down the information loss during data generation.Experiment results and theory analysis demonstrated that,SLCA algorithm could better bring down the correlativity of the sensitive values in the equivalence classes,and could better protect the privacy security of the data sets,compared with the traditional privacy preserving algorithms,the SLCA algorithm was effective on both information loss and execution time.Secondly,the traditional implemented algorithms for t-closeness model have often yielded large information loss and high execution time,and cannot well resist the sensitivity attack.For this problem,a?l,t?-closeness model was proposed based on the partition of the sensitive levels.?l,t?-Closeness model relaxes the equivalence class constrain of t-closeness model,it requires that the distance between the distribution of sensitive levels in the equivalence class and that in the whole data table is no more than a threshold t,and uses the Hellinger distance to measure the distance between the two distributions,in order to avoid the problems that setting standard distance and high execution time of EMD distance.At the same time,a?l,t?-closeness anonymization algorithm named?l,t?-CCA was proposed based on clustering,which achieved level partition of the sensitive attributes by self-information,and achieved the anonymizaion model by extracting the nearest tuples from the sensitive level buckets to construct equivalence classes.Experimental results show that,compared with the implemented algorithms for t-closeness model and?n,t?-closeness model,this algorithm not only has smaller information loss,but also has lower execution time,?l,t?-CCA algorithm can better reach the balance between utility and privacy preserving of the published data.Thirdly,traditional data anonymity methods were always appropriate for data table with only one sensitive attribute,but could not apply to data table with multi sensitive attributes immediately.First of all,in allusion to traditional l-diversity model only considering the form difference between the sensitive attributes,but not the sensitivity difference between the sensitive attributes,a new attack pattern which named sensitivity attack was proposed.Secondly,a new sensitive groups constructing method which based on sensitive attributes decomposition was proposed,in order to effectively avoid the high information loss brought by QI attribute generation.At the same time,a keyword weight evaluation method called IDF which widely used in information retrieval was used to measure the sensitivity of the sensitive values,and a multi sensitive attributes?l1,l2,…,ld?-diversity privacy preserving method for sensitivity attack which called MICD was proposed,it guaranteed the sensitivity difference between sensitive values in sensitive groups by sensitivity inverse clustering to improve the ability of resisting sensitivity attack of the data table.Experiment results and theory analysis demonstrated that although the executing time of the MICD algorithm was a little bit more than the Decomposition algorithm,but compared with the traditional privacy preserving algorithms,the MICD algorithm could better protect sensitive attributes against sensitivity attack,and more effective on information loss.Finally,publishing data which has been disturbed by the existing data perturbation methods can hardly maintain the clustering results of the original data.To solve this problem,a privacy preserving data perturbation method based on neighborhood topological potential entropy is proposed,which treats the d dimensions data set as a d dimensions space,and partitions the nodes in two types according to their neighborhood topology entropy.At the same time,a data perturbation algorithm was proposed.For a neighborhood dispersed node,replace the initial value by the average value of the nodes'k neighborhood values;for a neighborhood concentrated node,replace the initial value by random choosing a node value from the nodes'safety neighborhood.Experiments show that DPTPE algorithm can not only avoid leaking the data privacy,but also can better maintain the clustering utility of the data set.
Keywords/Search Tags:Data publishing, Privacy preserving, Data anonymization, Data perturbation, Information loss
PDF Full Text Request
Related items