| In recent years,with the rapid development of IoT and mobile computing technology,the number of intelligent devices located at network edge has grown rapidly,leading to an explosive growth in the scale of generated data.In order to improve the quality of various applications and services,efficient analysis of the massive data generated by the IoT devices using Artificial Intelligence(AI)technology,such as model training and inference,is necessary.In the traditional cloud computing,massive data needs to be sent through network protocols such as 4G,5G,and WiFi to remote cloud data centers for storage,processing,analysis,and other operations.Although cloud computing has achieved significant results in data processing and training,as the amount of data grows,aggregating all data to the cloud is no longer applicable.The transmission and computation of massive data have brought heavy loads to the core network and data centers,resulting in high network latency and response time.This not only affects the efficiency of model training and inference,degrading user experience,but also poses a risk of user privacy leakage.To satisfy the requirements of rapid response and decision-making for intelligent applications,edge computing technology has been introduced.By establishing storage,computing,and networking platforms in proximity to the network edge,some of the cloud data center services are deployed at the edge,promoting cooperative analysis and processing of data from both the edge and the cloud.This paradigm reduces the burden on the core network and data centers,while providing users with a rapid and secure service experience.By combining the powerful computing and communication resources of cloud computing with the short transmission response characteristics of edge computing,the value of edge-cloud collaboration can be maximized.However,in the edge-cloud collaborative architecture,especially when conducting distributed model training(also known as federated learning)at the edge,limited resource,heterogeneous device,dynamic environment,and non-IID data are often encountered,which poses enormous challenges to efficient model training and inference tasks.At the same time,as the number of edge devices continues to increase,how to achieve efficient edge-nodes control becomes another challenge in large-scale edge-cloud collaborative systems.To address these challenges,this dissertation proposes research on distributed model training based on edge-cloud collaboration,with the main research content and contributions as follows.(1)This dissertation proposes a joint decision-making algorithm for the distributed model training task offloading problem in the edge-cloud collaborative framework,with the aim of achieving joint optimization of data collection,service placement,and resource allocation.Considering the characteristics of resource constraints and heterogeneity,the three major challenges for performing distributed model training on edge servers are:how to collect training data from multiple data nodes,how to place services for training on edge nodes,and how to allocate resources for the execution of multiple tasks on the same edge node.In order to maximize system throughput while ensuring quality of service(QoS),this dissertations consider these three challenges jointly.Specifically,the joint optimization problem of data collection,service placement,and resource allocation is modeled as an NP-hard mixed-integer nonlinear programming problem.This dissertation proposes an approximation algorithm based on filtering and rounding method.The analysis shows that in most cases,the algorithm can achieve an approximation ratio of O(1/(log n/α+1)),(α>1).Through a large number of simulation experiments,this dissertation verifies the effectiveness of the proposed algorithm.Compared with traditional algorithms,the proposed algorithm can increase system throughput by 67%.(2)This dissertation proposes a semi-supervised federated learning framework based on progressive training for model structure optimization in edge-cloud collaborative framework.Existing semi-supervised federated learning methods usually adopt models with a fixed structure for training,which may lead to two main problems.Firstly,shallow models may not fit well to the increasing amount of pseudo-labeled data.Secondly,large models may suffer from overfitting when trained with a small number of labeled data samples.Building upon previous works,this dissertation further considers the challenges of resource limitation and system heterogeneity,and proposes a novel progressive training semi-supervised federated learning framework(FETA).Specifically,FETA gradually increases the model depth by adding sub-modules to the shallow model,and sets a confidence threshold to generate high-quality pseudo labels for unlabeled data.Then,a multi-armed bandit based algorithm is proposed to determine the appropriate model depth and the proper confidence threshold of pseudo labels for each edge node.Experiments on three benchmark datasets show that,compared with the baseline algorithms,FETA can achieve approximately a 10%improvement in test accuracy and reduce global communication bandwidth consumption by 40%.(3)This dissertation proposes an asynchronous model training mechanism that combines neighbor selection and gradient pushing to address the communication bottleneck of model global updating in the edge-cloud collaborative framework.Considering the characteristics of limited resource,heterogeneous device,dynamic environment,and non-IID data,to address the potential decrease in model accuracy due to the non-IID data and the impact of edge node heterogeneity on the efficiency of model training,this dissertation proposes an asynchronous distributed model training mechanism(AsyNG).Specifically,each edge node only pushes its local model to its optimal neighbors to improve resource utilization.To dynamically select neighbors for global updating in each iteration,this dissertation designs a priority-based algorithm to balance communication cost and training performance,which is based on the theoretical analysis of model convergence of AsyNG in non-IID and heterogeneous scenarios.This dissertation conducts extensive experiments on a testbed to evaluate the performance of AsyNG,and compared with existing asynchronous algorithms,AsyNG can reduce communication costs by 60%while achieving the same test accuracy,and shorten completion time by about 30%.(4)This dissertation proposes an VPC-based dual-path message delivery system which combining end-to-end oath and message queue path for edge-cloud collaborative model training system.To facilitate edge-cloud collaboration communication,the virtual machines in data center and edge nodes are typically placed in the same virtual private cloud(VPC).However,as the number and scale of virtual private clouds increase,efficient transmission of control messages from the cloud control plane to the edge node data plane has become a critical factor for system scalability.Existing end-toend transmission schemes may incur significant overhead in the control plane.Moreover,message queue schemes may lead to high data plane overhead.Therefore,this dissertation designs a high-performance control message delivery system(Meteor)that minimizes message transmission latency using a dual-path automatic switching mechanism between combines RPC and message queue paths.Our modular message delivery system is extensively tested with up to 100k container instances.Compared to state-ofthe-art solutions,Meteor reduces message transmission latency by 50%. |