| With the popularization of the mobile Internet and the improvement of communication technology,a series of modern terminal devices such as smart phones,wearable devices,and smart sensors are generating large amounts of data at all times.Relying on massive amounts of data and the improvement of hardware computing capabilities,machine learning technologies have entered a golden stage of rapid development and have already played a broad role in people’s daily lives.The use of federated learning technology allows multiple data holders to train the learning model together,which helps to solve the problems of insufficient data quantity and quality when learning alone.However,the design flaws of the machine learning algorithm itself and the difficulty of understanding the working principles of complex learning models have led to security risks such as privacy data leakage and deviation of learning results in the existing federated learning framework.Therefore,how to improve the security of the federated learning system and maximize the value of massive user data is a problem that needs to be solved urgently.In view of this,this thesis takes the construction of a secure federated learning system as the ultimate goal,and starts from the CIA traid of information security—confidentiality,integrity,and availability.Specifically,we conduct reaserch on the confidentiality of the training data,the integrity of the calculation process,and the usability of the learning result.The main content and innovations of this thesis are summarized as follows:(1)Aiming at the problem that the learning model may lead to information leakages of the private training data,this article proposes corresponding privacy protection model training schemes for the two typical machine learning algorithms: neural network and gradient boosting decision tree.For neural networks,this thesis proposes to use differential privacy to perturb the objective function of the learning task to blur the training results and prevent the upload results from leaking their private information;in addition,it proposes to use differential privacy to randomly select the participants in every round for aggregation to prevent the results from leaking information about the quality and distribution of participant data.For gradient boosting decision trees,based on the relatively independent characteristics of different subtree construction processes,this thesis proposes a collaborative training method for different participants to train the trees sequentially,and iteratively divides the training data set and parallel transmission model to reduce the differential privacy budget consumption and the communication overhead.(2)Aiming at the problem that participants in federated learning may obtain improper benefits by falsifying update results,this thesis proposes a sampling-based training process integrity verification method,which randomly selects multiple different iterations in the training process and uses a cryptographic-based verifiable computing technology to ensure that the participant complete the training task with a high probability;in order to reduce the computational cost of each verification,this thesis designs a novel method of making commitment for machine learning models and theoretically proves its security;in addition,for improving the efficiency,this thesis proposes multiple optimizations in expressing the machine learning algorithms.(3)Aiming at the problem that an attacker in federated learning can destroy the function of the learning model by designing special updates,this thesis proposes a method for detecting abnormal participants based on cross-checking.After receiving the uploaded results returned by the participants,the server randomly sends these results to multiple other participants for testing,and adjusts the weights of different participants when the model is aggregated according to the test results.In order to avoid the problem of detection failure when the data is not independent and identically distributed,this thesis proposes a method to dynamically adjust the distribution of detection tasks according to the distribution of data held by participants;in addition,this thesis also proposes to aggregates multiple upload results into a few sub-models,and add differential privacy noise to protect the privacy of participants while reducing the communication overhead. |