Font Size: a A A

Research On Data Mining Algorithms For Privacy Protection

Posted on:2019-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:N LiFull Text:PDF
GTID:2438330545956866Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the dramatic developments of informatization,the popularity of the Internet has been embedded into every corner of the society,which generates a large number of real-time data.In the face of the ever-increasing mass of data,how to derive valuable information from the growing data for users is a real problem to be solved urgently,which prompted data mining technology has become a research focus in such area.There are many technical braches of data mining,amount which classification is the most widely applied and promising type of mining algorithms.Among these classification algorithms,decision tree is widely recognized for its advantages of no background knowledge,easy understanding and high classification accuracy.In this paper,the research on data mining is progressive,focusing on the study of decision tree in the classification algorithm,and proposing improvement in the existing defects of the C4.5 decision tree algorithm.In the traditional C4.5 decision tree algorithm,the calculation of information gain ratio is large,and the logarithmic function is frequently invoked,which seriously affects the execution efficiency.According to the characteristics of information gain rate calculation principle in C4.5 algorithm,the paper combines the property of equivalent infinitesimals to simplify the computational complexity;Due to the process of discretization of continuous attributes in conventional C4.5 algorithm is too absolute and time-consuming,and thus this paper improves the discretization of continuous attributes by Fayyad's boundary decision theorem.Continuous attributes after discretization can be regarded as normal discrete attributes,and its information gain ratio can also be computed by the simplified method in this paper.Since the simplification of information entropy is established under certain conditions,this paper uses the attribute eigenvalue optimization strategy to compensate for the errors caused by simplification.Based on the Weka platform,the improved C4.5 algorithm,J48 and the traditional C4.5 algorithm were compared and evaluated on the multiple datasets of UCI Machine Learning Repository,and the accuracy of the algorithm was calculated using the 10-fold Cross Validation method provided by Weka platform.The experimental results show that the improved C4.5 algorithm had higher efficiency than the traditional C4.5 algorithm and the accuracy of classification was also improved.With the development of data mining technology,the issue of privacy protection has gradually appeared in the eyes of the public.Privacy-preserving data mining has naturally become a major research direction of data mining.Based on the anonymization technology in privacy protection,this paper proposes a decision tree model based on the privacy protection of the original data set.It mainly uses K-anonymity algorithm to protect the original data set,which generates K-anonymity data,and then uses improved C4.5 algorithms for classification.The Weka platform verifies the result that the improved algorithm in the classification of K-anonymous data is better than the J48 and the traditional C4.5 algorithm.Meanwhile,the usability and privacy protection degree of the model are all within the acceptable range.
Keywords/Search Tags:Data Mining, Decision Tree, C4.5, Weka Platform, Privacy Protection, K-Anonymization
PDF Full Text Request
Related items