With the rapid development of the Internet,data mining,machine learning,federated learning and other technologies have been widely used.However,in the process of collecting,analyzing and releasing massive data,the sensitive information in the data may be accidentally leaked,and data security and data privacy are facing severe challenges.Differential privacy is a kind of privacy protection technology based on data distortion,and it has strict mathematical theory proof.Not needing to consider the background knowledge of attackers,differential privacy can quantify the level of privacy protection and provide users with strong privacy guarantees.Local differential privacy is a new differential privacy protection model.It transfers the process of data privacy to the user side,which not only provides privacy protection for users but also ensures the information security of the server.Frequent itemset mining is a key technology in data mining,aiming at finding frequent itemsets in transaction data sets.Data mining algorithms with local differential privacy need to take account of security,result utility,communication cost and computing efficiency.This thesis focuses on the frequent itemset mining algorithms with local differential privacy protection,and the research content mainly includes the following three aspects:This thesis analyzes the existing frequency estimation algorithms with local differential privacy and finds that the accuracy of the frequency estimation algorithms with local differential privacy is closely related to data dimension,and the algorithms have the disadvantage of high communication cost.However,due to the limited network bandwidth and user device storage space,frequency estimation algorithms with local differential privacy should maintain a reasonable communication cost.This thesis designs two adaptive frequency estimation algorithms based on the balance of calculation accuracy and communication cost.The adaptive frequency estimation algorithms can reduce communication cost by applying random projection matrix and hadamard matrix,perform random response on low-dimensional data,ensure result utility and be applied to a wider range of scenarios.In order to solve the problem of the low efficiency of frequency estimation algorithms with local differential privacy,this thesis designs two algorithms.Random response is the mainstream perturbation mechanism for local differential privacy.The traditional random response method needs to traverse each coordinate of the vector and generate probability parameters according to the current coordinate value to perform probability flipping.This method has low efficiency.The first algorithm generates multiple random numbers at one time by setting tag variables and performs overall operations on user code vectors,which significantly accelerates the execution speed of the program.The data set involved in frequency estimation with local differential privacy is usually very large,and it is very time-consuming to traverse the entire data set.The second algorithm uses the central limit theorem and standard normal distribution to calculate a reasonable sampling number and samples the original data set.While improving the efficiency of program execution,this algorithm can still ensure a reasonable accuracy of frequency estimation.Frequent itemset mining with local differential privacy needs to estimate the frequency of each itemset in the candidate set,and the itemset whose estimated frequency is higher than the specified threshold is the frequent itemset.The frequent itemset mining process needs to involve multiple sub-query tasks.Therefore,users and the server need to communicate multiple times.In order to make the entire algorithm meet differential privacy,user grouping or privacy budget allocation can be considered.This thesis theoretically proves that when the probability distribution of the user group is consistent with the overall probability distribution,the mean squared error of the frequency estimation algorithm GRR and OUE when users are grouped evenly is smaller than the mean squared error when the privacy budget is allocated evenly.Experimental results show that when the number of users is insufficient or the probability distribution of user groups is not consistent with the overall probability distribution,the privacy budget even allocation is also an effective frequent itemset mining algorithm.This thesis conducts experiments on the synthetic data sets and the real world data sets,analyzes the result utility and calculation efficiency of the algorithms by using such indicators as the mean squared error,normalized cumulative rank and operation time,and verifies the superiority of the designed algorithms. |