Font Size: a A A

Feature Selection Based On K-anonymity And Decision Tree Integrated Privacy Protection

Posted on:2020-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:M HuangFull Text:PDF
GTID:2428330590496483Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Today,data information is exploding,data is both wealth and privacy.When data is shared and published,if there is a privacy breach,it will cause significant losses and harm to users,businesses,government agencies,etc.Therefore,in order to prevent privacy leakage,how to implement reasonable privacy protection measures in the data mining process has become an important issue in privacy protection data mining.Two key points to prevent data mining from revealing privacy are the data that are publicly shared and the output patterns of mining.So this paper has studied the two aspects of data release and output privacy.Firstly,for the release of table data,the privacy protection model and its implementation method are deeply studied,and the focus is on the feature selection with privacy protection of classification mining.Aiming at the problem of classification mining performance degradation in K-anonymous feature selection,a K-anonymity feature selection algorithm RFKA based on random forest feature importance is proposed to obtain a feature subset that satisfies K-anonymity and has good classification performance.The effectiveness of the RFKA algorithm is verified by a comparison with the Hamdist-based K-anonymous feature selection algorithm Greedy_Hamdist.On this basis,considering that feature selection is an NP-hard problem,the K-anonymous feature selection algorithm GAKAFS based on improved genetic algorithm is designed to optimize feature selection.In order to make the feature subset satisfy both K-anonymity and good classification performance,the initial population chromosome generation is based on the superior feature subset seed,and the K-anonymity of the chromosome is strictly judged.Privacy violation detection was designed during the crossover and mutation phases,and trend detection was added during the mutation phase.By using the feature subsets obtained by GAKAFS and RFKA for classification experiments,the results show that the feature subsets obtained by the GAKAFS algorithm under the same anonymity requirements have higher classification performance.Finally,considering that the mining result of the decision tree contains the information of the training set data,the privacy data will be leaked to some extent.In this paper,K-anonymity is combined with C4.5 with continuous value processing,and the KAC4.5algorithm is proposed.Ensure that the resulting decision tree satisfies K-anonymity constraints and does not cause privacy leaks.The classification experiments of decision trees with different K values ??show that the K-4.5 anonymity constraint does not have too much impact on the decision tree classification results,and can even solve the over-fittingproblem of the decision tree to some extent.In addition,KAC4.5 is compared with the ADT algorithm without continuous value processing to verify that the classification effect of KAC4.5 is better.
Keywords/Search Tags:Privacy protection, data mining, feature selection, data publishing, output privacy, decision tree
PDF Full Text Request
Related items