Font Size: a A A

Research On Anonymity Techniques For Privacy-Preserving Data Publishing

Posted on:2016-12-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XieFull Text:PDF
GTID:1318330542474113Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recent years,with the rapid development of information technologies and internet,the amount of collected and analyzed data which used for medical science,economic development and theory research in various fields has been increasing in an amazing rate.Peoples want to find the underlying rules and business values from the massive and high speed-growth data.However,personal privacy has always been involved to realize these demands,privacy preserving methods is of immediate concern to the researchers.It needs to study the methods and techniques of data publishing to better protect personal privacy information.The publishing data must well balance the privacy performance and data utility to meet the requirement of different data users.Furthermore,different types of data face to different privacy leak problem.Therefore,concerning the different privacy requirements and data utility in data publishing,privacy anonymization techniques have been carefully studied for different type data.Firstly,specific to the problem of privacy preserving for categorical attributes,the EMD(earth mover's distance)which t-closeness used to measure the distance between distributions is not well considering the stability between distributions,and is hardly to entirely measure the distance between distributions.When the stability between distributions is too large,it will greatly increase the risk of privacy.Aim to address these limitations and accurately measure the distance between distributions,based on traditional t-closeness,a new distance measurement was proposed which combined the EMD with KL divergence.At the same time,according to the hierarchy of sensitive attributes,it partitions a table into bucket base on the semantic similarity of SA values,and then uses greedy algorithm for generating the minimum groups which is satisfied with the requirement of the distance between distributions.In the end,it has adopted the k-nearest neighbour algorithm to choose similar Quasi-Identifiers(QI)values.Experimental results indicate that SABuk t-closeness model can bring down the information loss on the premise of consuming a little time,it can preserve privacy of sensitive data well meanwhile maintaining a high data utility.Secondly,specific to the limitation that it cannot ensure the information security of numerical sensitive value data of using privacy preserving methods for categorical sensitive attributes,privacy preserving methods for numerical sensitive value data was studied based on the characteristics of numerical sensitive value in this section.Proximity breach is a privacy threat specific to numerical sensitive attributes in data publication.Such breach occurs when an adversary concludes with high confidence that the sensitive value of a victim individual must fall in a short interval,even though the adversary may have low confidence about the victim's actual value.To address this breach,a model based on proximity breach for numerical sensitive attributes is proposed.At first,it divides numerical sensitive value into several intervals on the premise of protecting the internal relations between quasi-identifier attributes and numerical sensitive attributes.Secondly,it proposes a(k,?)-proximity privacy preserving principle to defense proximity privacy.In the end,a maximal neighborhood first algorithm(MNF)is designed to realize the(k,?)-proximity.The experiment results show that the proposed model can preserve privacy of sensitive data well meanwhile it can also keep a high data utility and protect the internal relations.Thirdly,specific to the information leak problem of multi sensitive data publishing,a l-maximum principle that based on traditional l-diversity model was proposed to meet the l-diversity requirements of multi sensitive attributes.The l-maximum principle controlled the appearance frequency of the sensitive values in the equivalence classes to avoid the probability of attack,and proved the security of this principle by theory proof.To protect the relationship between data and avoid the attribute leakage problem of lossy join method,the model partitions attributes by the dependency degree between attributes,so that the attributes which have higher dependency degree are partitioned to the same column.In the end,a multiple sensitive attributes l-maximum algorithm(MSA l-maximum)is proposed.The experiment results show that the proposed model can preserve the security of sensitive data,meanwhile it can also reduce the information hidden rate and keep a high data utility.Finally,specific to the characteristics of unlimited potential,high speed and changing frequently which differ from static data sets,a data stream anonymization method was proposed based on time density to avoid the limitations that privacy anonymization methods for static data cannot applied to data streams directly and also cannot reach to satisfied execution efficiency.K-mediods method was used to cluster the tuples,and the tuples which satisfied the information loss requirements was output.Concerning the strong temporal ability of data stream,time weight and time density was proposed,when the number of the publishing tuples reached to the upper limit,deleted the tuples with the mini time density to ensure the data reusability of the publishing tuples.Furthermore,in order to maintain the higher efficiency,the algorithm scans the data only once to satisfy the anonymization requirements for speeding up.The experimental results on the real dataset show that the algorithm is efficient and effective meanwhile the quality of the output data.
Keywords/Search Tags:Privacy preserving, Anonymization, Data utility, Static data, Data stream
PDF Full Text Request
Related items