Font Size: a A A

Research On Privacy-preserving Method For Data Mining

Posted on:2021-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:X N LiFull Text:PDF
GTID:2428330611473211Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In order to prevent the disclosure of individual sensitive information,data privacy-preserving techniques are widely developed in the fields of information transmission,identity authentication,pattern recognition and so on.In recent years,a variety of privacy models and algorithms have been proposed,among which data mining oriented methods have become a research hotspot.This type of research usually applies data anonymization to ensure information integrity and availability for publishing data,and reduce the risk of probability of privacy information disclosure.However,the nature of improving privacy intensity and reducing the information loss of anonymity is NP-hard,and most of methods are limited in practice because they only consider traditional data types.Aiming at different application scenarios and privacy requirements,several appropriate anonymity methods are proposed in this paper,which are focus on maximizing the availability of data mining phase while protecting individual sensitive information.The main research contributions are listed as follows:(1)Data anonymity has a disadvantage to easily lead to information loss,a privacy algorithm has been proposed based on natural equivalence group(NEG),which is applicable to datasets where a user has multiple records.In this paper,the information loss during generalization caused by the increase of data dimensions is studied quantitatively,and the concept of NEG is defined based on the characteristics of multi-record publishing data to break the tradition of taking tuples as a unit for anonymity efficiency.Then,a greedy clustering based anonymity algorithm for publishing data is proposed by the unit of NEG,the distances between records is associated with generalization information loss,so that the each step of data division can be carried out with minimum information loss to improve availability of anonymity publishing data.In addition,on the basis of traditional numerical attributes,the generalization and information loss measurement strategy of classified attributes are clarified.Experiment results show that the algorithm can perform well in reducing information loss and improving anonymity efficiency.(2)Most of the existing anonymity models and algorithms are aimed at resisting chain attacks on quasi-identifier(QI)attributes and ignore that sensitive attribute values can also form individual fingerprints,which can be used as attack tools for attackers.In this paper,a bidirectional anonymity model is proposed to protect both user identity information and sensitive information based on k-anonymity and l-diversity theory,in which the anonymity parameter of QI and sensitive attributes can be respectively set according to actual requirements.Then a privacy-preserving algorithm is proposed to satisfy this model.To further reduce the information loss caused by generalization,the algorithm provides gradient generalization strategy for sensitive attributes.The experimental results show that the it can not only improve the intensity of privacy-preserving,but also maintain well efficiency and data availability.(3)Traditional anonymity algorithms are mostly designed for relational data,which may cause attacks based on the background knowledge of sub-graph structure when applied directly to graph data.Then a clustering-based anonymity algorithm for social network graph data is proposed in this paper.Different from the previous research for relational data,the connection relationship of user and attribute information should be considered comprehensively during partition.The records with similar structure and attribute values can form super points,then the anonymized graph data can resist both sub-graph structure and attribute link attacks.Considering that there are many missing data in social network datasets,the unit information entropy is introduced into the measurement of attribute distance to reduce data pollution caused by the anonymity of missing data.The experimental results show that the algorithm has obvious advantages over similar algorithms in terms of clustering quality and availability of anonymity.
Keywords/Search Tags:privacy-preserving, anonymity algorithm, data publishing, generalization algorithm, social network
PDF Full Text Request
Related items