Font Size: a A A

Statistical Data Analyses With Local Differential Privacy

Posted on:2020-05-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:S W WangFull Text:PDF
GTID:1368330575466572Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the penetration of communication networks(e.g.,Internet,5G network and IoT)and the popularity of personal devices(e.g.,mobile phones,wearable devices),varied and fine-graid personal data of online users are being collected and recorded by service providers,for the purpose of mining usage data and improving quality of ser-vices.In the meantime,there are rising privacy concerns,for example,a user's location data could be abused to infer home address or daily activities,a user's usage log might reveal gender,age and even diseases of an individual.One of the major challenges in the big data era is simultaneously preserving the privacy of users and accomplishing the data analyses tasks of the service provider.Recently,differential privacy has emerged as a de facto privacy notion in the area of data privacy preserving.Compared to existing privacy models,such as k-anonymity and l-diversity,differential privacy is more robust to privacy adversaries with background knowledge,and hence could preserve data privacy more effectively.According to the application scenario,differential privacy could be categorized into three paradigms:centralized differential privacy for database systems,distributed dif-ferential privacy for distributed computing scenario based on cryptographic tools,local differential privacy(LDP)for distributed computing scenario based on data random perturbation.Among them,LDP does not rely on trustable parties and has low compu-tational overhead,hence is applicable to broad areas and domains,Google has provided LDP plugins in the Chrome web browser for privacy preserving data collection,Apple has applied LDP for collecting typing data on iOS.Within the definition and framework of LDP,there are unlimited mechanisms that satisfies LDP,the performances(e.g.,sta-tistical utility,computational complexity and communication complexity)of which are determined by the design of mechanisms.Though there have been lots of research results in the area of data analyses with LDP,theoretical works characterize bounds of statistical data utility in the high privacy regime,technical works provide several approaches to categorical data distribution es-timation.However,as an emerging area,existing work on the theoretical limits of LDP is insufficient.With the growth of variety in data types and statistical analyses tasks,existing LDP mechanisms suffer from application scenarios,data types,statistical anal-yses tasks and statistical inefficient in the big data era.As a remedy,this dissertation focuses on:(1)theoretically analyzing the upper bound of statistical utility in the LDP framework;(2)practically designing effective LDP mechanism for multiple types of data and diverse tasks of statistical analyses.Within the framework of LDP,this dissertation firstly analyzes statistical utility bounds of LDP from the perspective of mutual-information and distribution estimation,then proposes LDP mechanism for various types of data(e.g.,discrete quantified data,location data and set-valued data)and diverse analyzing tasks(e.g.,distribution estima-tion,mean estimation)respectively,and also shows their application for mobile device data mining and online A/B testing.Specifically,this dissertation makes following con-tributions.· Analyses of mutual information and distribution estimation error bounds under LDP.Given that existing results on LDP statistical data utility are still coarse and hold only when the privacy level is high,this study takes the mutual information between true value and its private view as the statistical utility cri-terion,then exploits the symmetry of random variable without prior knowledge,finally gives exact mutual-information upper bound under LDP.Furthermore,this study takes the distribution estimation error as the statistical utility criterion,pro-poses optimal mechanism for categorical data distribution estimation.· Discrete quantified data distribution estimation with LDP.By adopting the tailed LDP notion for quantified data:geo-indistinguishability,we observe that existing approaches emphasis the utility of a single private view,hence we pro-pose subset exponential mechanism that is optimized for statistical utility,ex-perimental results show that our mechanism reduces distribution estimation error of discrete quantified data(e.g.,ratings,metering readings and location data)by several orders.· Statistical analyses of set-valued data with LDP.Unlike existing works split-ting up the set-valued data and privacy budget,we propose to randomize set-valued data as a whole to take full utilization of the privacy budget.We provide optimal parameters of the randomized mapping and the corresponding item dis-tribution estimation error.Theoretical and experimental results demonstrate that our method averagely reduces estimation error by half.· Mean estimation with LDP and its applications to A/B testing.For one of the most common statistical estimation problems:mean estimation,we propose to adaptively discretize the bounded variable according to the domain knowledge and the privacy level,then randomize the discreted variable and debias the ran-domized variable,so that the estimation error is minimized.When employing the mean estimation method for A/B testing,where the metrics are multi-dimensional data and each metric has different importance,we provide optimal user assign-ment strategies to have better sensitivity of hypothesis testing on the metrics'movements.
Keywords/Search Tags:Data Privacy, Differential Privacy, Distribution Estimation, Mean Estimation, Categorical Data, Location Data
PDF Full Text Request
Related items