Data mining technology means to extract hidden, unknown, potentially useful information and knowledge from a large, incomplete, noisy, fuzzy, random data. Since data mining technology widely used, people can easily get hidden knowledge from a large number of data. However, because the mined data may contain a lot of sensitive information, while data mining technology bringing the huge value of knowledge study in information age, it also pose the threat to peopleâ€™s privacy and the data security. Nowadays in the field of commercial, military and public health, etc., people often face with how to protect their privacy under the premise of full cooperation and data sharing within their field. The problem of how to better combine data mining and privacy preserving technologies together is an important subject to solve.Privacy Preserving Data Mining (PPDM) refers to that we can use technical method, such as data perturbation, data reconstruction, encryption, to do the mining job, meanwhile the miner canâ€™t access the original private data under the premise of sufficient precision and accuracy. The aim of PPDM is to change the original data or improve the existing data mining method in order to discover statistical rule or hidden knowledge and rules from the raw data without disclosing any sensitive data to the outside. Nowadays there are many PPDM algorithms around the classification mining, clustering and association rules mining. However, there are few PPDM algorithms which focus on neural network learning and Bayesian networks incremental learning. Besides, the problem of PPDM in distributed environment is more complex than that in central environment. This made the technology of PPDM in central environment canâ€™t be used directly in distributed environment.The research work and our contributions in this thesis mainly consist of the following three aspects:1) Back-propagation neural network learning algorithms are widely used in medical diagnosis, bioinformatics, intrusion detection, homeland security and other fields. The common of these applications is that all of them need to extract patterns and predict trends from a large number of complex data. In these applications, how to protect the privacy of sensitive data and personal information is an important issue. At present, the vast majority of existing neural network learning algorithms did not consider how to protect the data privacy in the process of learning. This thesis proposes two privacy-preserving back-propagation neural network protocols applied to horizontally partitioned data and vertically partitioned data separately. In the construction process of neural networks, our aim is to compute the network weight vector for the training sample set. To ensure the private information of the neural network learning model can not be disclosed, we will assign the weight vector to all participants so that each participant own a part of private values of weight vector. In the calculation of neurons, we use secure dot product protocol, secure multi-party multiplication protocol and secure multi-party addition protocol, thus ensuring the middle and final values of the neural network weight vector are secure and will not be leaked. Finally, the constructed learning model will be securely shared by all participants. And each participant can use the model to predict the corresponding output for their respective target data. The results of experiments show the relationship between the execution time of two protocols and the number of participants and the cryptographic key length. In addition the results of experiments also show the difference of the test error rate between the proposed two protocols and the respective non-privacy preserving versions.2) Bayesian network learning is another research field of machine learning and data mining. In the research of Bayesian network learning algorithm, on the one hand we have to consider the issue of how to avoid the disclosure of data privacy. On the other hand, in the real world applications, data may be gradually available for Bayesian network, thus traditional Bayesian network learning algorithms can not be effectively applied. Incremental learning method can improve the efficiency of the algorithm in the aspect of security, execution time and memory allocation. Therefore, this thesis proposes a privacy-preserving Bayesian network incremental learning algorithm. We use sufficient statistical information in our incremental learning method, so we firstly proposed a formula for calculating the amount of sufficient statistics information. Then we improve the traditional K2 algorithm, add the concept of sufficient statistical information in it, and then propose an incremental K2 algorithm. At last, we proposed a privacy-preserving Bayesian network incremental learning algorithm. Using this algorithm we can calculate the network structure and parameters from the horizontally partitioned and gradually available data. The algorithm only need to keep the sufficient statistical information of each node and its possible parents in order to calculate score function value of each node and its parents, and then construct the Bayesian network structure. Experimental results show the proposed privacy-preserving Bayesian network incremental learning algorithm is more efficient than non-incremental learning algorithms. In addition the experimental results also show the relationship between the execution time of incremental learning algorithm and the number of members in candidate parents list.3) In distributed environment, how to protect data privacy while doing data mining jobs from a large number of distributed data is an important issue. This thesis solves this problem by designing system framework and designing algorithm. In the system framework design, we propose a new distributed framework for frequent pattern mining. In our design each network contains only a ConnectNode responsible for data transmission with other network in order to improve the efficiency of data transmission between different networks. Within the whole framework, only the trusted node can access the database. In the algorithm design aspect, we propose a privacy-preserving frequent pattern mining algorithm in distributed environment. The proposed algorithm does not use the previous method of transferring partitioned database, but to design a method of FP-tree transmission. The method of compressing the data firstly and then transfering the compressed data can improve network transmission efficiency. In this algorithm, computing nodes do not need to access the database, and each node doesnâ€™t exchange data with other nodes, thereby avoiding the disclosure of data privacy. Only trusted node can access the database. Even if the data of a computing node is stolen, this data is not the complete transaction, so this method can minimize the threat of data disclosure. Experimental results show that the proposed algorithm is more efficient than other parallel distributed frequent patterns mining algorithms.As is stated above, this thesis proposed two privacy preserving algorithms for neural network and bayesian network, in order to protect the private information from disclosing in the learning process. Besides, this thesis solved the contradiction between privacy preserving and data mining in distributed environment, and made these two technologies combined together. At last, this thesis proposed a privacy-preserving frequent pattern mining algorithm in distributed environment. This proposed algorithm not only can protect the private information of original data in the process of frequent pattern mining, but also can ensure mining useful rule and pattern in distributed environment. |