Font Size: a A A

A Study Of Statistical Inference On Massive Data Based On Differential Privacy

Posted on:2024-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:J S SongFull Text:PDF
GTID:2556306923475434Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
In recent years,thanks to the development of information technology,various industries such as finance,technology,healthcare,education and construction have been actively promoting digital reform,which is inevitably supported by massive data.Massive data,known as the "fourth paradigm" of scientific research,represents a new milestone in the development of science and technology,bringing about great changes to people’s lives and having a great impact on social and economic development.At the same time,traditional data processing methods and data analysis processes need to be improved in order to extract valuable information from the vast amount of data.The recent introduction of three laws,the Cyber Security Law,the Data Security Law and the Personal Information Protection Law,signifies the growing concern of our country about data security issues.In academic circles,the issue of privacy protection has also become a relatively hot topic at the moment.There are various ways to protect privacy,such as cryptography,anonymisation,random perturbation,etc.One of the most widely used is the Differential Privacy(DP)protection mechanism proposed by Dwork et al.in 2006.It is based on the idea of random perturbation,with a strict mathematical definition and axiomatic representation,and provides a good trade-off between privacy protection and data accuracy,which is a good way to achieve usable invisibility of data.In recent years,differential privacy has gradually become a default standard for data encryption.Although many algorithms have been proposed and practised in this framework,statistical inference of privacy-preserving data remains a challenge.In particular,traditional statistical analysis methods are difficult to apply to privacy-preserved data due to the uncertainty of the added noise variance and the distribution of the estimated quantities.In this paper,a new differential privacy-preserving mechanism is proposed in the context of massive data,and a general statistical inference framework,including parametric hypothesis testing and confidence interval estimation,is constructed on this basis.Considering the characteristics of massive data itself,if traditional statistical computation methods are used,it will inevitably lead to problems such as excessive computation and difficulty in achieving statistical analysis objectives.the BLB self-service sampling method can produce robust results in statistical inference of large data sets,which greatly improves computational efficiency.However,this sampling method does not take into account the privacy protection of the original data.Therefore,this paper improves the existing differential privacy algorithm and combines it with the BLB method to propose a new differential privacy mechanism to achieve statistical analysis that can be performed on aggregate parameters without exposing individual private data.At the same time,in order to address both the heterogeneity of the noise variance under the differential privacy mechanism,and the uncertainty of the distribution of the estimates,we use the central limit theorem under non-linear expectation theory to construct the corresponding test statistic and propose a hypothesis testing method.In addition,we demonstrate the excellent performance of our proposed inference procedure through a data simulation study.The differential privacy-preserving mechanism based on massive data proposed in this paper satisfies the privacy protection requirement without affecting the subsequent statistical inference,and is a reference value for the sharing of relevant data and statistical analysis.
Keywords/Search Tags:differential privacy, massive data, BLB algorithm, asymptotic normality, statistical inference
PDF Full Text Request
Related items