In recent years,with the rapid development of data mining,machine learning and deep learning,various industries have accumulated large amounts of data from users.In order to provide users with more personalized service,companies analyzed and processed these data to understand the habits and preferences of users.However,the data generated by users in daily life contains a large amount of personally sensitive information.Direct release or analysis will allow the criminals to collect the user's privacy,thereby performing cyber fraud,telephone fraud,Trojan horse attacks,etc.Differential privacy is a privacy protection mechanism with rigorous mathematical theory.By assuming that the attacker has the greatest background knowledge,adding carefully designed noise during the execution of each step of the algorithm allows the final output to protect the privacy of users,and the privacy protection level can be adjusted by adjusting the size of the privacy budget.At present,differential privacy has been applied to many fields of data mining,such as principal component analysis,clustering,frequent pattern mining,etc.Differential privacy protects privacy by adding noise.The size of the noise is closely related to the dimension of the dataset and directly affects the availability of frequent itemset mining results.Therefore,how to balance security and utility is a challenge faced by frequent itemsets mining algorithms.This thesis proposes two frequent itemsets mining algorithms that satisfy differential privacy protection from two starting points: improving the availability of mining results and improving the efficiency of algorithms.The thesis investigates and analyzes some existing frequent itemsets mining algorithms,and finds that reducing the longest transaction length in the dataset is the key method to improve the availability of mining results of frequent itemsets mining algorithms.How to process the datasets to balance the error introduced by noise and the reduced availability of datasets are the challenges faced by frequent itemsets mining algorithms.This thesis proposes a new differentially private FIM algorithmTrun Super.This algorithm truncates the transaction datasets to reduce the dimension,and sorts the items in decreasing order,then eliminates the items with less support.In this way,it can reduce the information loss of the frequent itemsets.Because frequent itemset mining algorithms that satisfy differential privacy protection need to traverse the dataset multiple times,reducing the time required to traverse a dataset is the key method to improve the efficiency of frequent itemset mining algorithms.The use of sampling datasets will definitely affect the accuracy of mining results,how to use the sampling dataset to improve the efficiency of the algorithm,while ensuring the availability of mining results as much as possible is the challenge faced by frequent itemsets mining algorithms.Aiming at this direction,this thesis proposes a frequent itemset mining algorithm Sample Trun that uses the central limit theorem to calculate the sampling number.The innovation of this algorithm mainly reflects two points.We use the central limit theorem to calculate a reasonable sampling number.And we propose several steps that are most suitable for sampling datasets by the analysis of each step.Finally,the experiments on several real datasets verify the superiority of the two algorithms proposed in the thesis. |