| With the development of big data and artificial intelligence technology,collecting fragmented data distributed in various fields to form big data and extracting valuable information through machine learning have become an important means to promote the development of digital economy.However,in reality,data owners are often unwilling or afraid to share their data with others.On the one hand,the numerous data breaches have raised widespread concerns about data security.On the other hand,increasingly stringent data security and privacy regulations have raised higher requirements for data mining process.Facing the "data silos" caused by these constraints,there is an urgent need to develop new technologies to effectively protect users’ raw data while meeting the new requirements of new security regulations.As a new paradigm of machine learning that can achieve "data available but not visible" when multiple parties collaborate to analyze data,federated learning provides a feasible solution to the dilemma of "data silos".Data owners can build a global model in a privacy-preserving manner without directly sharing raw data.As an important branch of federated learning,vertical federated learning provides federated learning solutions for scenarios in which data are vertically distributed among multiple participants,and offers effective technical support for cross-industry and crossdomain joint data mining.In this thesis,we focus on the classification algorithms that are widely used in the real world,and address the problems of poor scalability of vertical federated linear mode,high communication consumption of vertical federated tree model,and privacy leakage risk of vertical federated neural network in the current research of vertical federated learning classification algorithms.We conduct in-depth research on these three types of classification algorithms,including linear model,tree model and neural network,respectively.The main contributions are as follows:1.We propose a vertically federated Softmax regression algorithm supporting multiple participants.As a classic machine learning classification algorithm,linear models,especially logistic regression,have been widely used in reality due to their simplicity and efficiency.Although there are a number of schemes for migrating linear models into vertical federated learning,most of them are only applicable to binary classification scenarios with two participants,which cannot meet the widespread multi-participant multi-classification needs in the real world.To enrich the vertical federated linear model algorithms and improve the scalability of the algorithm,we propose a privacy-preserving vertically federated Softmax regression algorithm supporting multiple participants(referred to as PPVFSR algorithm).Firstly,the CKKS fully homomorphic encryption scheme is used to encrypt the model parameters and residuals to achieve ciphertext computation of the gradient.In addition,a third-party entity is introduced to assist in modeling to ensure that neither party can obtain the data of other parties during the entire training process.Security analysis shows that the PPVFSR algorithm can guarantee the privacy of participants’ data when all entities are semi-honest.Experimental results demonstrate that the PPVFSR algorithm can achieve an accuracy close to that of the centralized Softmax regression training method.2.We present a communication-efficient vertical federated CART decision tree algorithm.As a popular type of machine learning algorithm,many scholars have attempted to adapt traditional tree models for the situation where data are vertically distributed and propose vertical federated tree model algorithms.However,most existing solutions either suffer from potential privacy leakage or inefficient communication due to the use of secure multi-party computation.To address the trade-off between privacy preservation and communication consumption in existing vertical federated decision tree algorithms,we propose a communicationefficient vertical federated CART decision tree algorithm(VF-CART for short)based on the mapping technique and homomorphic encryption.Firstly,the feature values are mapped to bin values to construct histograms.Then,hash function and homomorphic encryption are used to secretly select the optimal split and the participant who has labels cannot obtain the sample subset of each split.In particular,since the VF-CART uses feature histograms instead of specific feature values,the number of ciphertexts transmitted between entities in the training and prediction phases is significantly reduced.In the tree construction phase,the participants without the labels only communicate with the third-party server once,and in the prediction phase,only one ciphertext is sent for predicting a sample.Security analysis and experimental results show that the VF-CART algorithm can significantly reduce communication without disclosing private data.3.We put forward a privacy-enhanced vertical federated neural network framework.Neural networks have an extremely wide range of applications in various fields due to their powerful fitting ability.Most of the neural network algorithms in vertical federated learning are based on SplitNN,where a complete model is split into two parts and assigned to different participants,and only the intermediate data obtained from the split layer are transmitted between them.However,existing research has shown that this mode is difficult to resist feature inference attack,in which the participant with labels infers the original data of other participants from the transmitted intermediate data based on an auxiliary dataset,and the state-of-the-art label inference attack,Model Completion(MC)attack,in which the participants with only features obtains other label information through semi-supervised learning based on a small amount of auxiliary label data.To resist the above attacks,we propose a privacy-enhanced vertical federated neural network framework(PE-VFNN for short).Firstly,to enhance the ability to resist the MC attack,the bottom model is replaced with an autoencoder,and the trained encoded output is used for training the top model owned by the participant who has labels.Additionally,to enhance the ability to resist the feature inference attack,three defense methods are proposed,including reducing the correlation between the encoded result and the original data,adding noise to the encoded result,and adding noise to the original data.Experimental results demonstrate that the PE-VFNN framework can significantly improve the ability to resist the MC attack and has a strong defense against the feature inference attack. |