Font Size: a A A

Efficient Task-oriented Quality Assessment For Large-scale Datasets

Posted on:2022-02-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:A R LiFull Text:PDF
GTID:1488306323463634Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence and Internet of Things tech-nologies,data is increasing rapidly,whereas data quality problems have become promi-nent.In the process of data collection,transmission,and processing,quality problems such as unlabeled,noisy samples result in serious consequences in decision-making.Data quality assessment has become an essential part.In centralized learning,existing quality assessment works focused structured data and individual data pieces other than datasets,while the latter is more commonly used by learning tasks.Simply averaging quality of data pieces,neglecting relationship among them,and neglecting the fact that datasets have different quality for different tasks would result in failing to capture the characteristics of a dataset.In addition,facing large datasets,existing quality assessment models though provide a good quantitative measurement,the computation overhead of nearly a second makes it difficult to apply to large datasets.Finally,in federated learn-ing(FL),there exist unbalanced distributed data and low-quality data.How to privately evaluate the quality of local samples under the resource constraints is a challenging prob-lem.Therefore,this work focused on data quality assessment and high-quality sample selection for many types of data and many task models in both centralized and federated deep learning scenarios.Contributions of this work are summarized as follows.1.This work proposes a task-oriented data quality assessment and sample selection framework for centralized tasks.Existing data quality assessment works have following drawbacks.Firstly,most previous attempts focus on assessing intrinsic quality of data,while few works consider contextual factors such as the target tasks or services,which requires that data content should be relevant to the task.Contextual factors have shown significant effects on data quality.Secondly,existing works mainly measure quality of an individual data pieces other than that of a whole dataset,while the latter is more commonly used by present services.In addition,for different data types and different application scenarios,the quality dimensions are different.To this end,we propose a task-oriented quality evaluation system,which comprehensively and efficiently mea-sures quality of datasets for a given task.Specifically,for multiple datasets,the quality evaluator evaluates their intrinsic quality,extract content-based features,and conduct contextual quality assessment.Based on the evaluation results,the evaluator further conducts rank aggregation to obtain a ranking of datasets according to their quality.The higher the ranking of a dataset is,the higher quality it has for the task.Then the evaluator could select high-quality datasets to participate in model training according to the quality ranking list.2.This work proposes a data quality assessment system for FL tasks.In the FL systems,many participants may possess erroneous data,which seriously hinders the global model from achieving a good performance.Towards improving the performance of an FL system,we propose an efficient data quality assessment framework based on influence functions,e.g.,identifying erroneous samples which have negative impacts on the FL model,and then fixes the error to improve the global model.The system consists of two main steps,hierarchical influence analysis,and influence-based client selection and model updating.First,we theoretically analyze the influential client identification method based on the training log of the model,and the influential sample identification methods based on influence functions.Based on theoretical analyses,we design a hier-archical influence approach to identify negatively influential clients and erroneous sam-ples.Finally,we propose an influence-based method of dynamically selecting clients to participate in model training to obtain a global model with faster convergence and higher test accuracy.3.This work proposes a sample-level data selection system for FL tasks.How to get large high-quality datasets has become a common bottleneck of many machine learning models and AI applications for the following reasons.1)It is much expensive to collect and label massive samples.2)The existence of erroneous samples makes it extremely difficult to distinguish qualified samples from erroneous ones.In FL,there are a collection of data owners,who are willing to participate in some FL tasks for a certain price.The server aims to select a set of high-quality samples for the target task and pay the owners who participate in the model training under the budget in a privacy-preserving way.The system consists of the following main steps,relevant clients fil-tering,client selection before training,dynamical client selection and sample selection,and model training.When an FL task arrives,the server uses private set intersection(PSI)method to filter relevant clients,and privately selects high-quality clients to max-imize statistical homogeneity and content diversity under the budget constraint using the determinantal point process(DPP)based algorithm.Then the server coordinates the selected clients to participate in the FL training.To further improve model per-formance and reduce training overhead,in each epoch of training,the server selects a certain proportion of important clients,and their important samples based on impor-tance sampling to participate in model training,thereby obtaining a global model with high performance.
Keywords/Search Tags:Data Quality Assessment, Deep Learning, Federated Learning, Influence Function, Importance Sampling
PDF Full Text Request
Related items