With the development of information technology,data science has influenced all aspects of people’s lives,machine learning and statistical analysis have gained the public’s attention.In the field of machine learning,causalities such as the massive volume of data and the intensive computation make the machine learning task even more unbearable for single computer.For the above reasons,distributed systems are gaining attention in the field of machine learning and statistical queries.For existing distributed machine learning models and frameworks,there is a lack of analysis of the impact of privacy constraints and communication constraints on model complexity.This thesis mainly studies the data volume analysis required for machine learning in distributed systems to achieve certain accuracy benchmark under differential privacy constraints and communication constraints,and compares the result in different settings of the privacy or communication constraints,the topology of the system and the size of nodes’ datasets.Statistical Query(SQ)Algorithm is the key method used in this thesis,SQ algorithm does not allow the learner to directly access the raw data,but only allows the learner to initiate a statistical query,and there is a responder noted by oracle to answer an approximation of the expected value of the learner’s statistical query.Since the computational model of differential privacy constraints and communication constraints can be formalized as adding noise to the computational results,statistical query learning can perform machine learning tasks in these scenarios and there has already been sufficient knowledge in theoretical analysis of the sample complexity of statistical query learning.In addition,statistical queries can be considered as averages of query functions over the entire instance domain and can be computed in parallel on multiple working nodes,indicating that the algorithm is capable to implement in a distributed framework.Therefore,using statistical query learning to analyze distributed machine learning under differential privacy constraints is an appropriate and effective choice.To address the data allocation schemes for distributed machine learning under differential privacy or communication constraints,three cases of star-shaped topology distributed learning models are constructed with differential privacy constraints,communication constraints or both constraints,and the algorithms for these models to answer statistical queries are designed respectively,combining statistical query learning sample complexity theory this thesis obtained the sample complexity of star-shaped topology distributed machine learning under these three restrictions.In contrast to the traditional local differential privacy model where each node hold only one sample entry,our models allow each working node to hold multiple samples.Before performing the analysis,this thesis proves the privacy guarantee of the model where both constraints exists.We proved that the sums of Laplacian noises or uniformly distributed noises will concentrate near the expected value of the noise,and the range shrinks as the number of noises increases.The final analysis shows that in the presence of differential privacy constraints or communication constraints,the more data held by a single node,the less noise needed to be added to protect privacy,and the number of nodes required becomes exponentially smaller.As the amount of data held by the nodes changes,the number of nodes that are required to carry through a statistical query changes as well.When multiple queries comes in,different number of worker nodes need to be allocated for each statistical query.To carry out the analysis of distributed machine learning topology under differential privacy or communication constraints,we considered the impact of different amounts of data on each node and different topologies of the systems on the traffic of communication,sample complexity and time complexity.In this thesis,we first relax the assumption that each node uses equal amount of data to train the machine learning algorithm,we consider the situation where each node holds different amount of data,and analyze the required sample complexity in that case.The result shows that the range of data volume on all nodes affects the sample complexity.When the range is large,the lower bound of the number of nodes is mainly influenced by the average of data held by one node.When the range is small,the bound is mainly influenced by the minimum of data held by one node.Other than that,in this thesis we built distributed learning models in different topologies(linear sequential propagation topology,bus topology,ring topology,and tree topology),we also designed the corresponding algorithm in each topology,proved the privacy guarantees of these algorithms,and analyze the sample complexity.As a result,a comparison and summarization of different topologies is made.Compared with the star topology,the linear sequential propagation topology,ring topology and tree topology can reduce the added noise while meeting the differential privacy constraints.The bus topology requires less traffic of communication to answer the statistical query,but the shared bus of multiple nodes is prone to conflict.Multi-segment rings and tree topologies effectively shorten the time required for machine learning. |