Font Size: a A A

Research On Differentially Private Data Publication Techniques For High Dimensional Data

Posted on:2018-10-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:N WangFull Text:PDF
GTID:1368330572464583Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and widespread usage of information technology,the big data era is coming with strong and ever-growing demands on analyzing data.These analyses are crucial and worth billion dollars in the business,which can be used to guide business activities to further improve the user experi-ence.Such analyzed data typically contain sensitive individual information.Hence,it is becoming an urgent issue to publish analysis results without posing privacy.Recently,differential privacy(DP),as the first practical privacy protection model,has attracted a lot of attention,as it can provide rigorous theoretical guarantee,even in the face of attackers with strong background knowledge.Not surprisingly,it is becoming a new research hotspot in the data security domain to design a DP-based algorithm with high utility output.This dissertation discusses how to publish statistics for high dimensional data under DP.These statistics essentially depend on the counts of some columns(dimensions)or column(dimension)patterns.However,a single record in high dimensional data may have a significant impact on the count of each column.In another word,the statistics are greatly sensitive to the variation of one record.That challenges existing works proposed under the assumption of low sensitivity,because now a large amount of noises have to be added into the published values,so as to protect the individual information.The problem this dis-sertation studies thereby is to build a series of new publication strategies for queries on high dimensional data.To the best of our knowledge,most queries basically fall into two categories:count publication and top-k columns publication.Each of them can be further divided,that is,unit count and range count for the former;frequent itemsets and frequent columns for the latter.This dissertation designs different yet important techniques specific to various queries,all of such contributions can be summarized as below.Firstly,a hybrid solution based on the truncating technique and grouping tech-nique is proposed to publish a differentially private histogram,which can be utilized to answer the unit count query on high dimensional data.Prior works have validated that truncating records or grouping columns can individually and effectively improve the accuracy of published results in different scenarios.The new hybrid solution is designed on top of such two techniques for accuracy,by leveraging their advantages.Besides,a smart penalty function with low sensitivity is designed to measure the errors caused by truncating and grouping operations.It contributes to computing more accurate parameters,such as truncating length and grouping size.Moreover,by a two phase selection method,such parameters can be derived efficiently,while,simultaneously,the final results are published with high accuracy.Extensive ex-periments on a broad spectrum of real-world datasets validate the effectiveness of the hybrid method with the two-phase selection technique,when compared with state-of-the-art methods.Secondly,a three-step framework is proposed to publish a differentially private histogram for range count query on high dimensional data.And the framework focuses on sequence data,which follow some dependence and can be modeled by a graph.Sequences might be identified by arbitrary paths in the graph.It poses great challenges to existing differentially private mechanisms because they mainly target at normalized domains with fixed and aligned dimensions.To tackle this problem,the three-step framework publishes data in three steps:1)carefully truncates the original sequences by trading off errors introduced by the truncation with those introduced by the noise added to guarantee privacy,2)decomposes the graph into path sub-domains based on a given group of queries so that some existing techniques can be used,and 3)employs a deeply optimized approach to construct a tree-based histogram for each sub-domain,to enhance the accuracy of range queries.Besides,the consistency enforcement is run for the tree-based histogram,to further boost the accuracy.Finally,we evaluate the performance of all proposals through a series of experiments.Thirdly,a new differentially private mining framework is designed to publish all top-k frequent itemsets.Most existing frameworks directly apply the generic differ-ential privacy mechanisms(laplace mechanism and/or exponential mechanism)into the Apriori framework to mine top-k results.That is simply but makes the privacy budget consumption proportional to k.A large privacy budget consumption means more noises are required and then impairs the accuracy of published itemsets.Hence,reducing the budget consumption is a key research direction.Towards this end,a novel framework called "PrivSuper" is designed and implemented,which contains both a new algorithm and a new differential privacy mechanism.Specifically,Priv-Super directly searches for maximal frequent itemsets,and then immediately adds their sub-itemsets to the final results with cost-free budget consumption.During the search,PrivSuper applies a customized mechanism to extend the current item-set with one more item,termed "sequence exponential mechanism".Notably,this mechanism does not consume any privacy budget,if the current itemset cannot be further extended(i.e.,it is one maximal frequent itemset).Extensive experiments using several real datasets demonstrate that PrivSuper achieves significantly higher result utility compared to previous Apriori-based solutions.Finally,a two-phase selection solution is proposed to publish top-k frequent columns under differential privacy.Existing works directly select frequent columns from the column domain,which is far from ideal due to the large privacy consump-tion or misjudgments for columns with frequencies close to the frequency(called threshold)of the kth frequent column.The new two phase selection solution is presented to carefully choose frequent columns in two different phases.Its main idea is to classify columns into two distinct categories based on whether or not the frequency is close to the threshold.Frequent columns are chosen from the two categories using different techniques,based on the property of columns in different categories.Furthermore,by analyzing the distribution of columns,a privacy free mechanism is designed to partition the columns into blocks dynamically and choose the frequent columns in a block-centric way.Hence,some columns can be judged as frequent without consuming privacy budget.Finally,the privacy free mechanism is proved to satisfy differential privacy theoretically.Extensive experiments on real datasets verify the significant improvement of our proposals on the accuracy.To sum up,this dissertation focuses on differentially private data publication on high dimensional data,including the problems of histogram publications for the unit count query and range count query,and the problems of top-k frequent itemsets and frequent columns publication.Many important and effective techniques are proposed to solve these problems.Theoretical analysis and experiments show our proposals significantly outperform existing up-to-date methods.
Keywords/Search Tags:differential privacy, high dimensional data, histogram publication, count query of the column, top-k
PDF Full Text Request
Related items