Font Size: a A A

Research On Key Technologies Of Privacy Preserving-oriented Modeling And Debugging In Federated Learning

Posted on:2024-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:S M DuanFull Text:PDF
GTID:1528307376985099Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data are the new oil of the era,containing immense value.However,data security and privacy protection are also the ”red line” that must be followed.Therefore,fnding a way to reconcile privacy protection with data value mining has become an urgent need.Federated learning has emerged as a promising technology in academia and industry as it enables collaborative training of machine learning models among multiple participants without sharing local data.However,when training encounters problems that lead to a decrease in performance of federated learning models,debugging federated learning tasks and improving model performance while protecting data privacy can pose signif-cant technical challenges.This is due to the fact that the training data and processes cannot be accessed in federated learning.due to the training data and training processes can not be accessed in federated learning.A typical workfow in federated learning involves problem defnition,simulation prototyping,federated training,model evaluation,and deployment.This paper analyzes the main problems in the modeling and training process of federated learning from the perspective of the modeling workfow.First,during the simulation prototyping stage,global data cannot be accessed and explored,making it difcult to design a suitable machine learning workfow.Second,in the federated training process,label noise and nonindependent and identically distributed(Non-IID)data problems can cause the model to not converge or converge to a poor state.Finally,training problems may lead to model performance degradation that is difcult to diagnose and debug during federated training.To address these problems and challenges,this paper focuses on the following research contents and contributions:(1)In this paper,a federated tabular data synthesis method was proposed to address the mode collapse issue caused by multiple modes in continuous columns and category imbalance in discrete columns in decentralized tabular data.To tackle the problem of multiple modes in continuous columns,we proposed a multimodal distribution normalization method that learns the multimodal distributions from the decentralized continuous columns and encodes them as prior knowledge into the training data.A federated conditional sampling method was also designed to rebalance the categories during the training process,ensuring that each category receives sufcient training.With the aim to prevent private data leakage during data synthesis,we presented a privacy consumption-based federated generative model.The experimental results demonstrated that our method achieved the best trade-of between data utility and privacy level.Specifcally,the tables generated by our method were the most statistically similar to the original tables in terms of data utility,and the evaluation scores of our method outperformed the state-of-the-art model in machine learning tasks.(2)In this paper,a data fltering-based federated label noise debugging method was proposed to collect clean data for training,without relying on an additional clean dataset or a robust initial model for data or participant selection.The main idea of our method is to identify clean data by the correlation between global data.First,we proposed a privacy-preserving data representation transformation algorithm to convert private data into a shareable privacy-preserving data representation.We theoretically proved that the proposed algorithm not only protects data privacy but also preserves the correlation between the original data.Second,we designed a-nearest neighbor graph-based data fltering method to identify clean data for training on the centralized privacy-preserving data representations.The collected clean data was used for federated learning training to eliminate the infuence of label noise data on the model.The evaluation results showed that our method outperformed the state-of-the-art approaches and was robust to various data distributions and noise levels.(3)In this paper,we proposed a data augmentation-based Non-IID data debugging method to address the low performance,privacy protection concerns,and high communication overhead faced by existing methods based on federated generative models or raw data sharing strategies in decentralized tabular data.Unlike existing methods that share generative models or raw data,our method mainly shares low-dimensional statistical information components,including multimodal distribution in continuous columns,cumulative distributions in discrete columns,and global covariance.Based on this statistical information,we proposed multimodal distribution transformation and inverse cumulative distribution mapping to synthesize decentralized tabular data for data augmentation and convert Non-IID data into IID data,thereby eliminating the infuence of Non-IID data on the model.We theoretically proved that the proposed method not only preserves data privacy but also preserves the original data distribution.Experimental results showed the superiority of our method over the state-of-the-art methods in terms of test performance,statistical similarity,and communication efciency.(4)In this paper,we proposed a diferential feature-based automated debugging method to diagnose and debug model training issues during the federated training stage,where data and training processes are inaccessible.To avoid privacy leakage resulting from manual analysis of the federated training process,we proposed an automated metadata collection and model diagnosis method.This method obtains the training data fow graph by parsing the training script and locates the target metadata in the data fow graph based on deep learning semantics and syntax.The process of metadata collection,analysis,and model diagnosis is entirely transparent to data scientists.Additionally,we designed a diferential feature-based model repair method to retrain the model with error weights repaired.This method selects samples that can repair the error weights of the model for retraining based on the model diagnosis results and the collected metadata,thereby improving model performance.The experimental results showed that our method achieved the best results in terms of model debugging efciency and performance improvement compared to existing methods.
Keywords/Search Tags:Federated learning, Federated data synthesis, Federated label noisy, Non-IID, Federated learning model debugging
PDF Full Text Request
Related items