Font Size: a A A

Privacy Preserving Data Mining Based On Rough Set Theory

Posted on:2014-04-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Q YeFull Text:PDF
GTID:1268330398479591Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining plays a key role in many database applications, reveals to us the hidden information and data patterns from the normal data. However, when data mining brings us knowledge and profits, it also poses the threat to people’s privacy and the data security. With the strengthening of people’s privacy protection awareness and the establishment of relevant laws and regulations, Privacy Preserving Data Mining (PPDM for short) has become an active area in data mining research. K-anonymity is one of the most important anonymity models to prevent privacy leakage. The principle is to generalize attribute values on quasi-identifiers before data publishing to avoid linking attacks, and thus to achieve the protection of individual privacy. However, data generalization increases the uncertainty of attribute values, and an improper generalization operation will result in unnecessary loss of information, reduce the utility of the anonymous data. Therefore, it is important to select the right granularity level for providing privacy preservation while improving the data utility.Rough set theory is a granular computing model, and can effectively analyze imprecise, inconsistent and incomplete information. One important application of rough sets is attribute reduction and classification rule acquisition in decision tables. However, most of the previous studies on the traditional rough set model and its extensions do not consider decision tables with hierarchical attribute values. Data with hierarchical attribute values are, however, commonly seen in real-world applications, and have been widely used in data warehouses, knowledge discovery at different levels of granularity and the K-anonymity model. It is becoming an important research topic in rough sets to extend the existing theories and approaches of rough sets to deal with hierarchical data. In addition, distributed mining is an area of rapid growth in data mining applications, and privacy preserving attribute reduction in distributed environments is necessary to address important research topic in rough sets.With decision tables with hierarchical attribute values and multi-sources as the research object, this dissertation systematically studies an extension model of rough sets and its application in PPDM. The main contributions can be summarized as follows.(1) To deal with hierarchical data, we extend Pawlak’s rough set model, and propose a novel Multi-Level Rough Set (MLRS for short) based on attribute value taxonomies and full-subtree generalization. MLRS extends the value domains of condition attributes under the original single-level granulation to a domain generalization hierarchy under a multi-level granulation, and extends one equivalence relation on the universe to a nested sequence of equivalence relations. The relationships between Pawlak’s rough set and MLRS are studied, and the nested sequence measures of MLRS are discussed. The results of these studies will be further refined and extended in rough set theory.(2) To select generalization level of attribute value in a decision table, based on attribute reduction in rough set theory, a novel concept of generalization reduction based on MLRS is presented from three perspectives of the rough set theory, such as positive region, information entropy, and knowledge granulation. Generalization reduction can induce the most abstract multi-level decision table with the same classification ability on the raw decision table, and no other multi-level decision table exists that is more abstract. It can avoid over-generalization or under-generalization. Furthermore, the relationships between attribute reduction in Pawlak’s rough set model and generalization reduction in MLRS are discussed, and develop two kinds of positive region heuristic algorithms for computing the generalization reduction which adopt the top-down refinement. Finally, an approach named RMTDR for mining multi-level decision rules is provided. It can avoid the generalization reduction process and mine decision rules from different concept levels. Through standard UCI data sets, experiment results show the validity and effectiveness of the proposed methods. These efforts will further promote the practical application of MLRS in data mining.(3) To protect individual privacy while maintaining the utility of the data in building classification models, we make use of attribute reduction in rough set theory, apply the conditional entropy to measure the classification ability of an anonymous table, and develop k-anonymity for classification analysis from the information view of rough sets. We extend general conditional entropy under single-level granulation to a hierarchical conditional entropy under multi-level granulation, and study its properties by dynamically bottom-up coarsening or top-down refining attribute values. Guided by these properties, we develop an efficient search metric for the tradeoff principle between information gain and anonymity loss, and present a novel algorithm for achieving k-anonymity, Hierarchical Conditional Entropy-based Top-Down Refinement (HCE-TDR). Theoretical analysis and experiments on real world datasets show that our algorithm is efficient and improves data utility. These studies will expand the application of MLRS, and further promote the practical application of the k-anonymity model for PPDM.(4) To solve the privacy preserving attribute reduction problem for multi-source heterogeneous decision tables, we design a secure set intersection cardinality protocol using semi-trusted third party and commutative encryption, develop a novel secure conditional information entropy protocol, and present a vertical privacy-preserving attribute reduction algorithm based on conditional information entropy. This algorithm can achieve co-reduction using attribute reduction based on the information view of rough set theory, which can perform accurate attribute reduction on the premise of no sharing of private information among participants. Also, to solve the privacy preserving attribute reduction problem for multi-source heterogeneous decision tables, we develop a novel secure relative granularity protocol based on secure set intersection cardinality protocol, and propos a vertical privacy-preserving attribute reduction algorithm based on relative granularity; and to solve the privacy preserving attribute reduction problem for multi-source isomorphic decision tables, we develop a novel secure relative granularity protocol based on secure scalar product protocol, and propos a horizontal privacy-preserving attribute reduction algorithm based on relative granularity. Examples demonstrate the feasibility and effectiveness of the proposed algorithms. These studies promote the rough set theory to applications of privacy protection feature selection in a distributed environment.
Keywords/Search Tags:privacy preserving data mining, rough set, k-anonymity, data generalization, reduction, decision rule
PDF Full Text Request
Related items