Font Size: a A A

Research On Several Problems Related To Privacy-preserving Microdata Publishing

Posted on:2022-06-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:R WangFull Text:PDF
GTID:1488306737492934Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data sharing assists people in concentrating their personal energy and computing resources on specific data applications by reducing repetitive work like data collection.Data publishing is an important method of open data sharing.Given that raw data may contain a large amount of information that could be closely related to individual privacies,data publishers should not directly release raw data to the public;otherwise,the privacies of the individuals from whom the data were obtained may be violated,and potential risks may be incurred,such as economic losses or emotional injuries.More seriously,privacy leakage may serve as a threat to public safety.In order to alleviate the conflict between data openness and data privacy,the privacy-preserving data publishing technique has been studied extensively.Privacy-preserving data publishing aims to protect the private information of raw data by employing certain techniques,such as data anonymization,data perturbation,and data encryption,while preserving the utility of the published data.In keeping with open data trends,many data publishing methods have been utilized extensively into practice.However,when government departments or business organizations collect massive amounts of data of more complex structures,the data to be published possess new characteristics,such as having multiple sensitive attributes,possessing high-dimensionality,or being distributed among different parties.These characteristics make most of the existing data publishing methods less effective.Hence,in order to deal with such emerging characteristics,this dissertation regards microdata as the research object and focuses on privacy-preserving microdata publishing for data mining tasks.First,microdata containing multiple sensitive attributes is studied.This work reviews the reason why the t-closeness model can resist the skewness attack.Then,the t-closeness model for one single sensitive attribute is extended.In view of the above analyses,a novel data publishing strategy and two algorithms for releasing microdata with multiple sensitive attributes are proposed,enabling the published data to satisfy the predefined k-anonymity and t-closeness requirements.Since the values of the sensitive attributes in any equivalence class must be as dispersed as possible over the entire data to make the published data satisfy tcloseness,the two proposed algorithms use different methods to partition records into groups regard to sensitive attributes.More specifically,one algorithm employs a clustering-based method,whereas the other leverages the principal component analysis.Subsequently,according to the similarity of quasi-identifier attributes,records are selected from different groups so as to construct equivalence classes.This operation enforces the protection of data privacy while carrying out data anonymization.Second,microdata of high-dimensionality is studied.This study analyzes the pressure of dimensionality when dealing with privacy-preserving high-dimensional data publishing.Moreover,this study defines the problem of releasing high-dimensional data for classification analysis.The challenge posed in relation to this problem,however,is how to reduce high dimensions in view of privacy models while preserving as much information as possible for classification.A solution to tackle the above challenge is then proposed,which is inspired by vertical partitioning,which encompasses vertically dividing the raw data into different disjointed subsets of smaller dimensionality.Afterward,a generalization method based on local recoding is employed to each subset separately in order to attain k-anonymity.Considering the hardness of the optimal implementation of k-anonymity,the local recoding method finds a near-optimal solution to improve its efficiency.The proposed method can reduce losses of information during data anonymization and improve the utility of the published data.Third,microdata containing single-valued attributes and set-valued attributes is studied.This study defines the problem of releasing such data for cluster analysis.Here,the challenge is how to ensure the similarity of cluster structures between the raw data and its published version.An approach is then proposed in order to address this by converting a clustering problem to a classification problem,in which class labels can be used to encode the cluster structure of the raw data and assist the masking process.The proposed approach probabilistically generalizes the raw data and adds noise to the generalized data.Furthermore,it ensures the utility of the published data for clustering analysis while confirming that the entire process meets ?-differential privacy,thus enhancing data privacy.Forth,microdata distributed among different parties is studied.Accordingly,this study expounds on the risk of privacy leakage during the process of data integration and data publication and analyzes relevant secure multi-party computing protocols.Next,a differentially private solution is proposed to anonymize data from two parties with arbitrarily partitioned data in a semi-honest model.Here,two privacy requirements are met.First,one party cannot learn extra information about the other party's data except for the final result as well as the inferred information.Second,the collaborative anonymization should satisfy e-differential privacy in order to protect the privacies of individuals from whom the integrated data are obtained.To meet such privacy requirements,the proposed distributed anonymization algorithm should guarantee that each step of the algorithm satisfies the definition of e-differential privacy while securing two-party computation.The first three points mainly focus on the issue related to centralized microdata publishing,whereas the last aspect deals with the problem pertaining to distributed microdata publishing.As a result,this study may enrich scenarios of privacy-preserving microdata publishing,enforce privacy protection of raw data,and maintain the utility of the published data.
Keywords/Search Tags:Privacy-preserving data publishing, Data anonymization, Differential privacy, Classification analysis, Cluster analysis
PDF Full Text Request
Related items