Font Size: a A A

Research On Random Forest Classification Algorithm Based On Differential Privacy

Posted on:2020-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2428330578460893Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the development of the information age,the amount of data has expanded dramatically,driving the development of a data-driven business model.The value of data has become increasingly prominent,and the problem of privacy leakage caused by this has gradually increased.The privacy of data has begun to receive the attention of people,so the research on privacy protection methods has become a hot topic in academic circles.In fact,differential privacy is an effective method of privacy protection.It protects the privacy of the original data by adding noise to the data and randomly perturbing the data.At the same time,it measures the risk of privacy disclosure through quantitative methods.In the actual application process,security performance can be dynamically adjusted according to requirements.Based on the research of differential privacy and decision tree and random forest classification algorithm,this paper improves the problem of introducing excessive noise and high data computation overhead.In view of the problem of excessive noise introduction,this paper analyzes from two angles: First,by changing the way the tree structure is generated,the multi-layer subtree is used to replace the tree node generation method to optimize the privacy budget;The tree replaces the geometric features of the structure and dynamically adjusts the privacy budget relationship between the upper and lower layers.In view of the problem that the data calculation overhead is too high,the paper uses the M-H sampling method to reduce the operation scale in the multi-layer subtree replacement algorithm.The specific work of this paper is as follows: First,replace the tree nodes with multi-layer subtrees,and design the evaluation function of the substructure subtree replacement structure.Second,use the M-H sampling search method to reduce the operation scale of the multi-layer decision subtree.Third,according to the geometric characteristics of the multi-layer subtree replacement structure,MLSR-GPB algorithm is proposed to dynamically adjust the privacy budget and optimize the noise.Fourth,combined with multi-layer subtree replacement structure,fusion M-H sampling search and MLSR-GPB algorithm,proposed MLSR-DT algorithm,and tested on the data vote and mushroom,the experimental results show that the smaller the spanning tree height,the higher the classification accuracy;the multi-layer subtree replacement level L lager,the higher the classification accuracy;and the classification effect on the big data set is better than the small data set;the classification effect with the information gain as the evaluation function is better than the classification effect with the Gini coefficient as the evaluation function.Fifth,Incorporating the idea of integrated learning,through the sampling of data sets and decision attribute sets,the proposed algorithm MLSR-RF further improves the classification accuracy and reduces the amount of calculation.Compared with the DiffP-ID3 algorithm,the classification accuracy is improved by 5%-10%.Sixth,in order to further test the practicability of the MLSR-RF algorithm,the test was carried out on the breast cancer dataset,and good results were obtained.The classification accuracy was maintained between 80% and 95%,and the safety performance analysis was given.
Keywords/Search Tags:privacy protection, data mining, classification algorithm, MLSR-DT, MLSR-RF
PDF Full Text Request
Related items